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Preface 


This book has been written for a first course in probability and was developed 
from lectures given at the University of Illinois during the last five years. 
Most of the students have been juniors, seniors, and beginning graduates, 
from the fields of mathematics, engineering and physics. The only formal 
prerequisite is calculus, but an additional degree of mathematical maturity 
may be helpful. 

In talking about nondiscrete probability spaces, it is difficult to avoid 
measure-theoretic concepts. However, to develop extensive formal machinery 
from measure theory before going into probability (as is done in most 
graduate programs in mathematics) would be inappropriate for the particular 
audience to whom the book is addressed. Thus I have tried to suggest, when 
possible, the underlying measure-theoretic ideas, while emphasizing the 
probabilistic way of thinking, which is likely to be quite novel to anyone 
studying this subject for the first time. 

The major field of application considered in the book is statistics (Chapter 
8). In addition, some of the problems suggest connections with the physical 
sciences. Chapters 1 to 5, and Chapter 8 will serve as the basis for a one- 
semester or a two-quarter course covering both probability and statistics. 
If probability alone is to be considered, Chapter 8 may be replaced by 
Chapter 6 and Chapter 7, as time permits. An asterisk before a section or 
a problem indicates material that I have normally omitted (without loss of 
continuity), either because it involves subject matter that many of the 
students have not been exposed to (for example, complex variables) or 
because it represents too concentrated a dosage of abstraction. 

A word to the instructor about notation. In the most popular terminology, 
P{X < x} is written for the probability that the random variable X assumes 
a value less than or equal to the number z. I tried this once in my class, and 
I found that as the semester progressed, the capital X tended to become 
smaller in the students’ written work, and the small x larger. The following 
semester, I switched to the letter R for random variable, and this notation 
is used throughout the book. 
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Fairly detailed solutions to some of the problems (and numerical answers 
to others) are given at the end of the book. 

I hope that the book will provide an introduction to more advanced 
courses in probability and real analysis and that it makes the abstract ideas 
to be encountered later more meaningful. I also hope that nonmathematics 
majors who come in contact with probability theory in their own areas find 
the book useful. A brief list of references, suitable for future study, is given 
at the end of the book. 

I am grateful to the many students and colleagues who have influenced my 
own understanding of probability theory and thus contributed to this book. 

I also thank Mrs. Dee Keel for her superb typing, and the staff of Wiley 
for its continuing interest and assistance. 


Urbana, Illinois, 1969 Robert B. Ash 
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Basic Concepts 


1.1 INTRODUCTION 


The origin of probability theory lies in physical observations associated with 
games of chance. It was found that if an “‘unbiased”’ coin is tossed independ- 
ently m times, where 7 is very large, the relative frequency of heads, that is, the 
ratio of the number of heads to the total number of tosses, is very likely to 
be very close to 1/2. Similarly, if a card is drawn from.a perfectly shuffled 
deck and then is replaced, the deck is reshuffled, and the process is repeated 
over and over again, there is (in some sense) convergence of the relative 
frequency of spades to 1/4. 

In the card experiment there are 52 possible outcomes when a single card 
is drawn. There is no reason to favor one outcome over another (the principle 
of “insufficient reason’”’ or of “‘least astonishment’’), and so the early workers 
in probability took as the probability of obtaining a spade the number of 
favorable outcomes divided by the total number of outcomes, that is, 13/52 
or 1/4. 

This so-called “‘classical definition’’ of probability (the probability of an 
event is the number of outcomes favorable to the event, divided by the total 
number of outcomes, where all outcomes are equally likely) is first of all 
restrictive (it considers only experiments with a finite number of outcomes) 
and, more seriously, circular (no matter how you look at it, “equally likely”’ 
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essentially means “equally probable,” and thus we are using the concept of 
probability to define probability itself). Thus we cannot use this idea as the 
basis of a mathematical theory of probability; however, the early proba- 
bilists were not prevented from deriving many valid and useful results. 

Similarly, an attempt at a frequency definition of probability will cause 
trouble. If S,, is the number of occurrences of an event in n independent 
performances of an experiment, we expect physically that the relative fre- 
quency S,,/n should coverge to a limit; however, we cannot assert that the 
limit exists in a mathematical sense. In the case of the tossing of an unbiased 
coin, we expect that S,,/n — 1/2, but a conceivable outcome of the process is 
that the coin will keep coming up heads forever. In other words it is possible 
that S,,/n— 1, or that S,/n > any number between 0 and 1, or that S,,/n 
has no limit at all. 

In this chapter we introduce the concepts that are to be used in the con- 
struction of a mathematical theory of probability. The first ingredient we 
need is a set Q, called the sample space, representing the collection of possible 
outcomes of a random experiment. For example, if a coin is tossed once we 
may take Q = {H, T}, where H corresponds to a head and 7 to a tail. If 
the coin is tossed twice, this is a different experiment and we need a different 
Q, say {HH, HT, TH, TT}; in this case one performance of the experiment 
corresponds to two tosses of the coin. 

If a single die is tossed, we may take Q to consist of six points, say Q = 
{1,2,...,6}. However, another possible sample space consists of two 
points, corresponding to the outcomes “‘N is even” and “N is odd,”’ where V 
is the result of the toss. Thus different sample spaces can be associated with 
the same experiment. The nature of the particular problem under considera- 
tion will dictate which sample space is to be used. If we are interested, for 
example, in whether or not N > 3 in a given performance of the experiment, 
the second sample space, corresponding to “N even”’ and “‘N odd,”’ will not 
be useful to us. 

In general, the only physical requirement on Q is that a given performance 
of the experiment must produce a result corresponding to exactly one of the 
points of Q. We have as yet no mathematical requirements on Q; it is simply a 
set of points. 

Next we come to the notion of event. An “event” associated with a random 
experiment corresponds to a question about the experiment that has a yes or 
no answer, and this in turn is associated with a subset of the sample space. 
For example, if a coin is tossed twice and 2 = {HH, HT, TH, TT}, “the 
number of heads is <1”’ will be a condition that either occurs or does not 
occur in a given performance of the experiment. That is, after the experiment 
is performed, the question “Is the number of heads < 1?’’ can be answered 
yes or no. The subset of 2 corresponding to a “yes” answer is A = {HT, TH, 
TT}; that is, if the outcome of the experiment is HT, TH, or TT, the answer 
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B = {first toss = second toss } 


je > 
G@ SNE 


Ficure 1.1.1 Coin-Tossing Experiment. 


A = {number of heads =1} 


to the question “Is the number of heads < 1?’’ will be “‘yes,”’ and if the out- 
come is HH, the answer will be “no.”’ Similarly, the subset of 2 associated 
with the “event’’ that the result of the first toss is the same as the result of the 
second toss is B = {HH, TT}. 

Thus an event is defined as a subset of the sample space, that is, a collection 
of points of the sample space. (We shall qualify this in the next section.) 

Events will be denoted by capital letters at the beginning of the English 
alphabet, such as A, B, C, and so on. An event may be characterized by listing 
all of its points, or equivalently by describing the conditions under which the 
event will occur. For example, in the coin-tossing experiment just considered, 
we write 


A = {the number of heads is less than or equal to 1} 


This expression is to be read as “‘A is the set consisting of those outcomes 
which satisfy the condition that the number of heads is less than or equal to 
1,” or, more simply, “A is the event that the number of heads is less than or 
equal to 1.”” The event A consists of the points HT, TH, and TT; therefore 
we write A = {HT, TH, TT}, which is to be read “‘A is the event consisting 
of the points HT, TH, and TT.’’ As another example, if B is the event that 
the result of the first toss is the same as the result of the second toss, we may 
describe B by writing B= {first toss = second toss} or, equivalently, 
B = {HH, TT} (see Figure 1.1.1). 

Each point belonging to an event A is said to be favorable to A. The event 
A will occur in a given performance of the experiment if and only if the 
outcome of the experiment corresponds to one of the points of A. The entire 
sample space Q is said to be the sure (or certain) event; if must occur on any 
given performance of the experiment. On the other hand, the event consist- 
ing of none of the points of the sample space, that is, the empty set @, is 


called the impossible event; it can never occur in a given performance of the 
experiment. 


1.2 ALGEBRA OF EVENTS (BOOLEAN ALGEBRA) 


Before talking about the assignment of probabilities to events, we introduce 
some operations by which new events are formed from old ones. These 
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operations correspond to the construction of compound sentences by use of 
the connectives “or,” “‘and,”’ and “‘not.’’ Let A and B be events in the same 
sample space. Define the union of A and B (denoted by A U B) as the set 
consisting of those points belonging to either A or B or both. (Unless other- 
wise specified, the word “‘or’’ will have, for us, the inclusive connotation. 
In other words, the statement “‘p or q’’ will always mean “p or qg or both.’’) 
Define the intersection of A and B, written A A B, as the set of points that 
belong to both A and B. Define the complement of A, written A’, as the set of 
points which do not belong to A. 


p> Example 1. Consider the experiment involving the toss of a single die, 
with N = the result; take a sample space with six points corresponding to 
N = 1, 2, 3, 4, 5, 6. For convenience, label the points of the sample space 
by the integers 1 through 6. 


A B A 
AUB ANB 


Ficure 1.2.1 Venn Diagrams. 


Let A = {N is even} and B= {N > 3} 
Then 
A UB= {Nis even or N > 3} = {2, 3, 4, 5, 6} 
A OB= {Nis even and N > 3} = {4, 6} 
A° = {N is not even} = {1, 3, 5} 
Be = {Nis not > 3} = {N < 3} = {1,2} < 


Schematic representations (called Venn diagrams) of unions, intersections, 
and complements are shown in Figure 1.2.1. 

Define the union of n events A,, Ag,..., A, (notation: Ay U+:+ U Ay, 
or U?_, A,) as the set consisting of those points which belong to at least one 
of the events A,, A,,...,A,. Similarly define the union of an infinite se- 
quence of events A,, As, .. . as the set of points belonging to at least one of the 
events A,, Ay,... (notation: A, U A, U-::, or U2, A,). 

Define the intersection of n events A,,..., A, as the set of points belonging 
to all of the events A,,..., A, (notation: Ay N Ag N+** NO A,, or I), A:)- 
Similarly define the intersection of an infinite sequence of events as the set of 
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points belonging to all the events in the sequence (notation: Ay N 4, N::-, 
or ),;2,A4,). In the above example, with A = {N is even} = {2, 4, 6}, 
B={N > 3} = (3,4, 5, 6}, C = {N = 1 or N = 5} = {1, 5}, we have 


AUBUC=Q, ANBNC=2 
AUB UC= &{2, 4, 6} U {1, 2} U {1, 5} = {1, 2, 4, 5, 6} 
(A UC) N [(A 1 B)] = £1, 2, 4, 5, 6} M {4, 6}¢ = {1, 2, 5} 


Two events in a sample space are said to be mutually exclusive or disjoint 
if A and B have no points in common, that is, if it is impossible that both A 
and B occur during the same performance of the experiment. In symbols, 
A and B are mutually exclusive if d 1 B = @. In general the events Aj, 
Ag,..., A, are said to be mutually exclusive if no two of the events have a 
point in common; that is, no more than one of the events can occur during 


C 
FiGURE 1.2.2 AQN(BUC)=(ANB)UANC). 


the same performance of the experiment. Symbolically, this condition may be 
written 


A, 1A; = @ fori ~j 


Similarly, infinitely many events A,, Ao, ... are said to be mutually exclusive if 
A, OA; = © fori ¥ j. 

In some ways the algebra of events is similar to the algebra of real numbers, 
with union corresponding to addition and intersection to multiplication. For 
example, the commutative and associative properties hold. 


AUB=BUA, AU(BUC)=(AUB)UC 
ANB=BOA, AN(BOC)=(ANB)NC (1.2.1) 


Furthermore, we can prove that for events A, B, and C in the same sample 
space we have 
AN(BUC)=(ANB)UANC) (1.2.2) 


There are several ways to establish this; for example, we may verify that the 
sets of both the left and right sides of the equality above are represented by 
the area in the Venn diagram of Figure 1.2.2. 
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Another approach is to use the definitions of union and intersection to 
show that the sets in question have precisely the same members; that is, we 
show that any point which belongs to the set on the left necessarily belongs to 
the set on the right, and conversely. To do this, we proceed as follows. 


xEAN(BUC)>2EA and rEBUC 
>xeEA and (ce BorxeC) 


(The symbol = means “implies,’’ and <> means “implies and is implied by.”’) 


CasE 1. xe B. Then xe A and xE B, so xE ANB, soxe(ANB)VU 
(A NC). 


CasE 2. xEC. Then xE€ A and xEC, sore ANC, soxEe(ANB)VU 
(A OC). 


Thus *€ AN (B U C) > E(ANB)U(ANO); that is, AA 
(BUC)< (A NB)U(A NC). (The symbol © is read “is a subset of”’; 
we say that A, © A, provided that « € A, => x € A,; see Figure 1.2.3. Notice 
that, according to this definition, a set A is a subset of itself: A < A.) 


Conversely: Let xe (A OB) U(A OC). ThenxzeANBorxeA nc. 


CasEl. xE€AOB.ThenzweB,soxreBUC,soreA N(BUC). 
CAsE2. xE€ ANC. ThenzeC,soreBUC,soxEeAN(BUC). 


Thus (A NB) U(A NC) AN(B UC); hence 
AN(BUQ=(ANBU(ANC) 
As another example we show that 


(A, U A, Ues+ UA,)® = Ay? OO Ae N° +> OA,!S (1.2.3) 


n 
” 


FiGureE 1.2.3. Ay © Ag. 
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The steps are as follows. 


meE(Ay U-++ UA) <> 2A, U':+ UA, 
<> it is not the case that x belongs to at least one of 
the A, 
<> 2x enone of the A, 
<xzeAdA; for all i 


<=neAl nA, 


An identical argument shows that 


( U A,) = n Ag (1.2.4) 


and similarly 


(9 4,) —~UAS ie (A, ne NA) HAL Ue UAS (12.5) 


t=1 i=1 ° 
Also 
(a A) = U4; (1.2.6) 
i=1 t=1 


The identities (1.2.3)—(1.2.6) are called the DeMorgan laws. 
In many ways the algebra of events differs from the algebra of real numbers, 
as some of the identities below indicate. 


AUVUA=A AUA=Q 
ANA=A AN A = 
AQNQ=A AUZS=A 
AUQ=Q ANDG=26 (1.2.7) 
Another method of verifying relations among events involves algebraic 
manipulation, using the identities already derived. Four examples are given 


below; in working out the identities, it may be helpful to write A U B as 
A+ Band A Bas AB. 


1 AU(ANB)=A (1.2.8) 


PROOF. 
A+ AB = AQ + AB = AQ + B) = AQ =A 


2, (AUB) N(AUC)=AVU(BNOC) (1.2.9) 
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PROOF. 
(A+ BA + C)= (A+ B)A+ (A+ BC 
= AA + AB+ AC+ BC (note AB = BA) 
= A(Q+ B+ C)+ BC 


= AQ + BC 
=A+ BC 
3. AU[(A N B)]=Q (1.2.10) 
PROOF. . 
A+ (AB =A4t+ 4°4+ B=0Q04+ B?=0 
4, (ANBVU(ANBU(ANB=AVUB (1.2.11) 
PROOF. 
AB’ + AB+ AB = AB‘+ AB+AB+ AB [see (1.2.7)] 
= A(B°+ B)+ (A+ A)B 
= AQ + QB 
=A-+B 


(see Figure 1.2.4). 
As another example, let Q be the set of nonnegative real numbers. Let 


A,=|0,1—2)=|reQ:0<2<1—"| n= 1; 2.368 
n n 

(This will be another common way of describing an event. It is to be read: 
‘°A,, is the set consisting of those points x in Q such thatO0 < # < 1 — I/n.” 
If there is no confusion about what space {2 we are considering, we shall 
simply write A, = {7:0 < «<1 — I1/n}.) Then 


UA =AyS eee 


n=1 


A 4, = {0} 


n= 


Ficure 1.2.4 Venn Diagram Illustrating 
(AN B)VUANBVU(AOAB=AVUB. 


1.2 ALGEBRA OF EVENTS (BOOLEAN ALGEBRA) 9 


As an illustration of the DeMorgan laws, 


(U 4.) = [0, 1)° = [1, 0) = {e:2>1} 


N4°=N (1-7, 0) = 0) 


n=] n=1 


(Notice that x > .1 — 1/n for alln = 1,2,...<>a> 1.) Also 


(14,) = {0} = (0, 0) = {e:2>0} 


U 4,: ~U (1 _ . 2) 2510.65) 


n=1 


PROBLEMS 


1. 


An experiment involves choosing an integer N between 0 and 9 (the sample space 
consists of the integers from 0 to 9, inclusive). Let A ={N <5},B=G3<N< 
7}, C = {N is even and N > 0}. List the points that belong to the following 
events. 


ANBOC, AVU(BOC), (AUB)NCY, (ANB)A[AVC)] 


. Let A, B, and C be arbitrary events in the same sample space. Let D, be the 


event that at least two of the events A, B, C occur; that is, D, is the set of points 
common to at least two of the sets A, B, C. 
Let D, = {exactly two of the events A, B, C occur} 

D, = {at least one of the events A, B, C occur} 

D, = {exactly one of the events A, B, C occur} 

D; = {not more than two of the events A, B, C occur} 
Each of the events D, through D, can be expressed in terms of A, B, and C by 
using unions, intersections, and complements. For example, Dj = AUBU C. 
Find suitable expressions for D,, D,, D4, and D;. 


. A public opinion poll (circa 1850) consisted of the following three questions: 


(a) Are you a registered Whig? 
(b) Do you approve of President Fillmore’s performance in office? 
(c) Do you favor the Electoral College system? 
A group of 1000 people is polled. Assume that the answer to each question must 
be either “‘yes” or “no.” It is found that: 

550 people answer “‘yes”’ to the third question and 450 answer “‘no.” 

325 people answer “‘yes” exactly twice; that is, their responses contain two 
“‘yeses’’ and one “‘no.” 
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100 people answer “‘yes”’ to all three questions. 

125 registered Whigs approve of Fillmore’s performance. 

How many of those who favor the Electoral College system do not approve 
of Fillmore’s performance, and in addition are not registered Whigs? HINT: 
Draw a Venn diagram. 


4. If A and B are events in a sample space, define A — Bas the set of points which 
belong to A but not to B; that is, A — B = A A B°. Establish the following. 
(a) AN(B-C)=(ANB)-(ANQ) 

(b) A—-(BUC)=(A-B)-C 
Is is true that (A — B) UC = (A UC) — B? 
5. Let © be the reals. Establish the following. 


I 

C 8 
Pat 
+ 

pa I 
o 
“eer 


ie 1 
(a, b) = U (4,6 -=| 


n=1 


ll 

3s 
I 8 
en 
Q 
| 

a oy 
~~ 
Ld 


= 1 

[a,b] = f) a, 6 +7) 

6. If A and B are disjoint events, are A° and B¢ disjoint? Are A 1 C and BNC 
disjoint? What about A U Cand BUC? 

7. If Ay, c An_1 ee Aj, show that Qin A; = An, UF A; a A. 

8. Suppose that A,, Aj,... is a sequence of subsets of 2, and we know that for 
each n, ()7_, A, is not empty. Is it true that I), A; is not empty? (A related 
question about real numbers: if, for each n, we have >”_, a; < 5, is it true that 

ie. @) 
1% < 52) 


9. If A, B,, By, ... are arbitrary events, show that 
AQ(U B) =U AB) 
i i 


This is the distributive law with infinitely many factors. 
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We now consider the assignment of probabilities to events. A technical 
complication arises here. It may not always be possible to regard all subsets 
of 2 as events. We may discard or fail to measure some of the information in 
the outcome corresponding to the point w €Q, so that for a given subset 
A of Q, it may not be possible to give a yes or no answer to the question 
“Is w € A?’ For example, if the experiment involves tossing a coin five times, 
we may record the results of only the first three tosses, so that A = {at least 
four heads} will not be “measurable’’; that is, membership of w € A cannot 
be determined from the given information about ow. 

In a given problem there will be a particular class of subsets of Q called the 
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“class of events.”’ For reasons of mathematical consistency, we require that 
the event class A form a sigma field, which is a collection of subsets of Q 
satisfying the following three requirements. 


QEF (1.3.1) 
A,,A,...€¥F implies UA,ceF (1.3.2) 
n=1 
That is, ¥ is closed under finite or countable union. 
AeF implies AEF (1.3.3) 


That is, ¥ is closed under complementation. 

Notice that if 4,, 4,,...¢€4, then A,°, A.’,...€F by (1.3.3); hence 

nui An EF by (1.3.2). By the DeMorgan laws, 1\7_, 4, = (U2, 4, 
hence, by (1.3.3), N°, A,¢€F. Thus F is closed under finite or countable 
intersection. Also, by (1.3.1) and (1.3.3), the empty set @ belongs to F. 

Thus, for example, if the question “Did A, occur?’ has a definite answer 
forn = 1, 2,...,s0 do the questions “Did at least one of the A, occur?” 
and “Did all the A, occur?’ 

Note also that if we apply the algebraic operations of Section 1.2 to sets in 
F , the new sets we obtain still belong to F. 

In many cases we shall be able to take F = the collection of all subsets 
of Q, so that every subset of Q is an event. Problems in which F cannot be 
chosen in this way generally arise in uncountably infinite sample spaces; 
for example, 22 = the reals. We shall return to this subject in Chapter 2. 

We are now ready to talk about the assignment of probabilities to events. 
If A € F, the probability P(A) should somehow reflect the long-run relative 
frequency of A in a large number of independent repetitions of the experi- 
ment. Thus P(A) should be a number between 0 and 1, and P(Q) should be 1. 

Now if A and B are disjoint events, the number of occurrences of A U B 
in n performances of the experiment is obtained by adding the number of 
occurrences of A to the number of occurrences of B. Thus we should have 


P(A U B) = P(A) + P(B) if A and B are disjoint 
and, similarly, 


P(A, U:::UA,) => P(A) if Ay,..., A, are disjoint 
t=1 


For mathematical convenience we require that 


P( U A,] => P(A,) 


when we have a countably infinite family of disjoint events A,, A,.... 
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The assumption of countable rather than simply finite additivity has not 
been convincingly justified physically or philosophically; however, it leads 
to a much richer mathematical theory. 

A function that assigns a number P(A) to each set A in the sigma field F 
is called a probability measure on F , provided that the following conditions 
are satisfied. 


P(A)>0 = forevery AGF (1.3.4) 
P(Q) = 1 (1.3.5) 

If A,, Ag... are disjoint sets in F, then 
P(A, U Ag U>+ +) = P(A,) + P(AQ) + °°: (1.3.6) 


We may now give the underlying mathematical framework for probability 
theory. 


DEFINITION. A probability space is a triple (Q,. 4%, P), where Q is a set, 
F a sigma field of subsets of Q, and P a probability measure on F. 


We shall not, at this point, embark on a general study of probability 
measures. However, we shall establish four facts from the definition. 
(All sets in the arguments to follow are assumed to belong to ¥.) 


1. P(o)=0 (1.3.7) 


Proor. AU @ =A;hence P(A U @) = P(A). But A and @ are disjoint 
and so P(A U @) =P(A)+ P(2). Thus P(A) =P(A) + P(2); conse- 
quently P(@) = 0. 


2. P(A U B) = P(A) + P(B) — P(A OB) (1.3.8) 
Proor. A =(A MB) U(A QO B), and these sets are disjoint (see Figure 
1.2.4). Thus P(A) = P(A 1 B) + P(A CO B’). Similarly P(B) = P(A TO B) + 
P(A° 1 B). Thus P(A) + P(B) — P(A 0 B) = P(A NB)+P(ANB) + 
P(A 1 B) = P(A U B). Intuitively, if we add the outcomes in A to those in 
B, we have counted those in A M B twice; subtracting the outcomes in 
A (1 B yields the outcomes in A U B. 
3. If BC A, then P(B) < P(A); in fact, 
P(A — B) = P(A) — P(B) (1.3.9) 
where A — B is the set of points that belong to A but not to B. 


Proor. P(A) = P(B) + P(A — B), since BC A (see Figure 1.3.1), and 
the result follows because P(A — B) > 0. Intuitively, if the occurrence of B 
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always implies the occurrence of A, A must occur at least as often as B in 
any sequence of performances of the experiment. 


4. P(A, U Ay U- ++) < P(A) + P(Ag) # °° (1.3.10) 


That is, the probability that at least one of a finite or countably infinite 
collection of events will occur is less than or equal to the sum of the prob- 
abilities; note that, for the case of two events, this follows from P(A U B) = 
P(A) + P(B) — P(A 1 B) < P(A) + P(B). 


PRoor. We make use of the fact that any union may be written as a 
disjoint union, as follows. 


A, UA, U++> =A, U (AS NA) U (AY? AN A’ OAs) U8 U 
(Ayo ON A,’ A+++ OAL, OA,) Uo?) (4.3.12) 


To see this, observe that if « belongs to the set on the right then x € A,° N 
-++ 1 At_y O A, for some n; hence x € A,. Thus x belongs to the set on the 
left. Conversely, if x belongs to the set on the left, then 7 € A, for some n. 
Let my be the smallest such n. Then x € Ay N+*+* O Ajay O An, and so x 
belongs to the set on the right. Thus 


P(A, WAV) => PAY As Al4 14, =) PG,) 
n=1 


n=1 


using (1.3.9); notice that 


AS AM¢ NA 4 WAYS Ax 


REMARKS. The basic difficulty with the classical and frequency definitions 
of probability is that their approach is to try somehow to prove 
mathematically that, for example, the probability of picking a heart 
from a perfectly shuffled deck is 1/4, or that the probability of an 
unbiased coin coming up heads is 1/2. This cannot be done. All we 
can say is that if a card is picked at random and then replaced, and the 
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process is repeated over and over again, the result that the ratio of 
hearts to total number of drawings will be close to 1/4 is in accord 
with our intuition and our physical experience. For this reason we 
should assign a probability 1/4 to the event of obtaining a heart, and 
similarly we should assign a probability 1/52 to each possible outcome 
of the experiment. The only reason for doing this is that the con- 
sequences agree with our experience. If you decide that some mysterious 
factor caused the ace of spades to be more likely than any other card, 
you could incorporate this factor by assigning a higher probability to 
the ace of spades. The mathematical development of the theory would 
not be affected; however, .the conclusions you might draw from 
this assumption would be at variance with experimental results. 

One can never really use mathematics to prove a specific physical 
fact. For example, we cannot prove mathematically that there is a 
physical quantity called “‘force.’’ What we can do is postulate a 
mathematical entity called “force” that satisfies a certain differential 
equation. We can build up a collection of mathematical results that, 
when interpreted properly, provide a reasonable description of certain 
physical phenomena (reasonable until another mathematical theory is 
constructed that provides a better description). Similarly, in probability 
theory we are faced with situations in which our intuition or some 
physical experiments we have carried out suggest certain results. 
Intuition and experience lead us to an assignment of probabilities to 
events. As far as the mathematics is concerned, any assignment of 
probabilities will do, subject to the rules of mathematical con- 
sistency. However, our hope is to develop mathematical results that, 
when interpreted and related to physical experience, will help to 
make precise such notions as “‘the ratio of the number of heads to the 
total number of observations in a very large number of independent 
tosses of an unbiased coin is very likely to be very close to 1/2.” 

We emphasize that the insights gained by the early workers in prob- 
ability are not to be discarded, but instead cast in a more precise 
form. 


PROBLEMS 


1. Write down some examples of sigma fields other than the collection of all 
subsets of a given set Q. 


2. Give an example to show that P(A — B) need not equal P(A) — P(B) if Bis not 
a subset of A. 
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1.4 COMBINATORIAL PROBLEMS 


We consider a class of problems in which the assignment of probabilities can 
be made in a natural way. 

Let Q be a finite or countably infinite set, and let F consist of all subsets of 
Q. 

For each point w,€Q, i=1,2,..., assign a nonnegative number p,, 
with >; p; = 1. If A is any subset of Q, let P(A) = >..<4 p;- Then it may 
be verified that P is a probability measure; P{w,} = p;, and the probability of 
any event A is found by adding the probabilities of the points of A. An 
(Q, F, P) of this type is called a discrete probability space. 


p> Example 1. Throw a (biased) coin twice (see Figure 1.4.1). 


Let E, = {at least one head}. Then 
EF, = Ay U As U Ag 
Hence 
P(E,) = P(A1) + P(A2) + P(As) 
= .36 + .24 + .24 = .84 
Let E, = {tail on first toss}; then 


E, = Ag U A, 


Ay = {HH}, 4, = {HT} 
A, = {TH}, A, = {TT} 
Assign P(A;) = .36 
P(A,) = P(A3) = .24 
P(A,) = .16 


Ficure 1.4.1 Coin-Tossing Problem. 


16 BASIC CONCEPTS 


and 
P(E.) = P(A3) + P(A,) = .4 <€ 
In the special case when Q = {@,,...,@,}and p; = 1/n,i=1,2,...,n, 
we have 
P(A) = number of points of A 7 favorable outcomes 
total number of points in Q total outcomes 


corresponding to the classical definition of probability. 


Thus, in this case, finding P(A) simply involves counting the number of 
outcomes favorable to A. When n is large, counting by hand may not be 
feasible; combinatorial analysis is simply a method of counting that can often 
be used to avoid writing down the entire list of favorable outcomes. 

There is only one basic idea in combinatorial analysis, and that is the 
following. Suppose that a symbol is selected from the set {a,...,a,}; 
if a; is chosen, a symbol is selected from the set {b;,... , bj}. Each pair of 
selections (a;, b,;) is assumed to determine a “result’’ f(i, 7). If all results are 
distinct, the number of possible results is nm, since there is a one-to-one 
correspondence between results and pairs of integers (i,j), i= 1,...,n, 
Dee rr fi? 

If, after the symbol 5,; is chosen, a symbol is selected from the set 
{Cij1> Cijao -- + » Cijp¢, and each triple (a;, b,;, c,;,) determines a distinct result 
SG, j, k), the number of possible results is nmp. Analogous statements may be 
made for any finite sequence of selections. 

Certain standard selections occur frequently, and it is convenient to classify 
them. 

Let a,,..., a, be distinct symbols. 


Ordered samples of size ,, with replacement 


The number of ordered sequences (a,;,,...,4;), where the a; belong to 
{ay,...,Q@nj,1Isn Xn X-°+** X n(F times), or 
n” (1.4.1) 


(The term “with replacement” refers to the fact that if the symbol a; is 
selected at step k it may be selected again at any future time.) 

For example, the number of possible outcomes if three dice are thrown 1s 
6x 6 xX 6 = 216. 


Ordered Samples of Size r, without Replacement 


The number of ordered sequences (a;,,...,4,), where the a; belong to 
{a,,...,a,}, but repetition is not allowed (1.e., no a; can appear more than 
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once in the sequence), is 
n! 


(n ey 


nin—1)---M—r+i1)= r=1,2,...,n (1.4.2) 
(The first symbol may be chosen in m ways, and the second in n — 1 ways, 
since the first symbol may not be used again, and so on.) The above number 
is sometimes called the number of permutations of r objects out of n, written 
(n),. 

For example, the number of 3-digit numbers that can be formed from 
1,2,...,9, if no digit can be repeated, is 9(8)(7) = 504. 


Unordered Samples of Size r, without Replacement 


The number of unordered sets {d;,5 Aes a; }; where the a,,k =1,...,7r, 
are distinct elements of {a,,...,a,} (i.e., the number of ways of selecting 
r distinct objects out of 7), if order does not count, is 


(") es (1.4.3) 


r —rli(n—n! 


To see this, consider the following process. 

(a) Select r distinct objects out of m without regard to order; this can be 
done in (”) ways, where (%) is to be determined. 

(b) For each set selected in (a), say {a,,,...,@;,}, select an ordering of 
a;,.--+,4;,. This can be done in (r), = r! ways (see Figure 1.4.2 for n = 3, 
r = 2). 

The result of performing (a) and (b) is a permutation of r objects out of n; 


hence 
0 
r 


! 
i ery P= i Jaen 
r r!(n—r)! 


We define (;) to be n! /0!n! = 1, to make the formula for (*) valid for 
r=0,1,...,n. Notice that (7) = (,",). 
(7) is sometimes called the number of combinations of r objects out of n. 


! 


pe ee 
(n — r)! 


(1), = 


ll 


Or 


(a) 12 13 23 


eX ae 


(b) 12 21 13 31 23 32 


Ficure 1.4.2 Determination of (7). 
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Unordered Samples of Size r, with Replacement 


We wish to find the number of unordered sets {a,,..., a; }, where the 
a,, belong to {a,,...,a,} and repetition is allowed. As an example, let 
n = 3 andr = 3. Let the symbols be 1, 2, and 3. List all arrangements in a 
column so that a precedes b if and only if a, read as an ordinary 3-digit 
number, is <b. In an adjacent column list a new set of sequences formed from 
the old by adding 0 to the first digit, 1 to the second digit, and 2 to the third 
digit. 

111 123 = (1 + 0,1 + 1,1 + 2) 
112 124 = (1+ 0,1+ 1,2 + 2) 


113 125 
122 134 
123 135 
133 145 
222 234 
223 235 
233 245 
333 345 


In the first column we have unordered samples of size 3 (out of 3), with 
replacement. In the second column we have unordered samples of size 3 
(out of 5), without replacement. In this way we can set up a one-to-one 
correspondence between unordered samples of size r (out of n) with replace- 
ment, and unordered samples of size r (out of nm + r — 1) without replace- 
ment. Thus the number of such samples is 


(" i ) (1.4.4) 


r 


An alternative way of looking at unordered samples with replacement is 
to count all sequences (@;,,...,4;,), each a, € {a,..., ay}, subject to the 
constraint that sequences having the same occupancy numbers r, = the 
number of occurrences of a,, k =1,2,...,m, are identified. The r, are 
nonnegative integers satisfying r; + re +-°**r, =r; hence we must count 
the number of nonnegative integer solutions (",...,1,) of the equation 
ry +:°:: +r, =r. This may be done combinatorally as follows. 

Consider an arrangement of r stars and n — 1 bars, as shown in Figure 


Ls | [seed bell aad 


Ficure 1.4.3 Counting Unordered Samples with Replacement. 
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1.4.3 for n = 3, r = 4 (the thicker bars at the sides are fixed). Each arrange- 
ment corresponds to a solution of ry +°+: +r, =r. The number of 
arrangements is the number of ways of selecting r positions out of +r — 1 
for the stars to occur (or m — 1 positions for the bars); that is, ("7"). For 
n = 3, r = 4, there are 15 solutions. 


ry Te Tz, Sample 


3430303 
ApQ2A2Ao 
A441 QQ, 
2030303 
AzM20303 
20203 
1030303 
Q4Q1A303 
4410103 
A4Q202M 
AQ 4204 
AQ 
4420 
A4Q2M203 
Q1Q20303 


mBPreNWNRM WNRK COCO hk O SO 
mB NFR NWOOCOWNK OC FO 
Ne KS COCO RNWRHRNW OC 


> Example 2. Find the probability of obtaining four of a kind in an 
ordinary five-card poker hand. 

There are (**) distinct poker hands (without regard to order), and so we 
may take Q to have (*?) points. To obtain the number of hands in which there 
are four of a kind: 

(a) Choose the face value to appear four times (13 choices: A, K, Q, 

ie Agnes) 

(b) Choose the fifth card (48 ways). 

Thus p = (13)(48)/(*?). Figure 1.4.4 indicates the selection process. 


Note. The problem may also be done using ordered samples. The number of 
ordered poker hands is (52)(51)(50)(49)(48) = (52), (the drawing is 
without replacement). The number of ordered poker hands having 
four of a kind is (13)(48) 5!, so that p = (13)(48)(6!)/(52),; = 
(13)(48)/(2) as before. Here we may take the space 2’ to have (52); 
points; each point of © corresponds to 5! points of Q". < 


p> Example3. Three balls are dropped into three boxes. Find the probability 
that exactly one box will be empty. 
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Ag 


(b) King of Spades Ky °° °* 2% gq Be 


Ficure 1.4.4 ‘Counting Process for Selecting Four of a Kind. There is a one-to-one 
correspondence between paths in the diagram and favorable outcomes. 


In problems of dropping r balls into n boxes, we may regard the boxes 
as (distinct) symbols a,,... , @,; each toss of a ball corresponds to the selec- 
tion of a box. Thus the sequence a,a,a,a3 corresponds to the first ball into 
box 2, the second and third into box 1, and the fourth into box 3. 

In general, an arrangement of r balls in n boxes corresponds to a sample of 
size r from the symbols a,,... , d,. If we require that the sampling be with 
replacement, this means that a given box can contain any number of balls. 
Sampling without replacement means that a given box cannot contain more 
than one ball. If we consider ordered samples we are saying that the balls are 
distinguishable. For example, a,a, (ball 1 into box 3, ball 2 into box 7) is 
different from aa, (ball 1 into box 7, ball 2 into box 3); in other words, we 
may regard the balls as being numbered 1, 2,...,7r. Unordered sampling 
corresponds to indistinguishable balls. 

If there is no restriction on the number of balls in a given box, the total 
number of arrangements, taking into account the order in which the balls are 
tossed (i.e., regarding the balls as distinct), is the number of ordered samples 
of size r (from {a,,..., @,}) with replacement, or n’. If the boxes are energy 
levels in physics and the balls are particles, the Maxwell-Boltzmann assump- 
tion is that all n” arrangements are equally likely. 

If there can be at most one ball in a given box, the number of (ordered) 
arrangements is (),. If the order in which the balls are tossed is neglected, 
we are simply choosing r boxes out of m to be occupied; the Fermi-Dirac 
assumption takes the (%) possible selections of boxes (or energy levels) as 
equally likely. 
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We might also mention the Bose-Einstein assumption; here a box may 
contain an unlimited number of balls, but the balls are indistinguishable; 
that is, the order in which the balls are tossed is neglected, so that, for ex- 
ample, a,Q1a,a is identified with a,a3a,a,. Thus the number of arrangements 
counted in this scheme is the number of unordered samples of size r with 
replacement, or ("7-1). The Bose-Einstein assumption takes all these 
arrangements as equally likely. 

To return to the original problem, we have boxes a, d,, and a3 and se- 
quences of length 3 (three balls are tossed). We take all 3* = 27 ordered 
samples with replacement as equally likely. (We shall see that this model— 
ordered sampling with replacement—corresponds to the tossing of the balls 
independently; this idea will be developed in the next section.) Now 


P{exactly 1 box empty} = P{box 1 empty, boxes 2 and 3 occupied} 
-+ P{box 2 empty, boxes 1 and 3 occupied} 
-+ P{box 3 empty, boxes 1 and 2 occupied} 


Furthermore 


P{box 1 empty, boxes 2 and 3 occupied} = P{a, does not occur in the 
sequence 4, a,a;,, but a, 
and a, both occur} 


If a, does not occur, either a, or a3 must occur twice, and the other symbol 
once. We may choose the symbol that is to occur twice in two ways; the 
symbol that occurs once is then determined. If, say, a3 occurs twice and a, 
once, the position of a, may be any of three possibilities; the position of the 
two a,’s is then determined. Thus the probability that box 1 will be empty 
and boxes 2 and 3 occupied is 2(3)/27 = 6/27 (in fact the six favorable out- 
COMES ALE AM2M3, AzgMgMg, AgAzA, AgAgA2, AgA2Q3, aNd A2A303). 

Thus the probability that exactly one box will be empty is, by symmetry, 
3(6)/27 = 2/3. < 


> Example 4. In a 13-card bridge hand the probability that the hand will 
contain the A K QJ 10 of spades is (4”)/(?2). (The A K QJ 10 of spades must 
be chosen, and afterward eight cards must be selected out of 47 that remain 
after the five top spades have been removed.) 

Now let us find the probability of obtaining the A K Q J 10 of at least one 
suit. Thus, if Ag is the event that the A K Q J 10 of spades is obtained, and 
similarly for Ay, Ap, and Ag (hearts, diamonds, and clubs), we are looking for 


P(Ag U Ag U Ap U Ac) 


The sets are not disjoint, so that we cannot simply add probabilities. It is 
possible to obtain, for example, the A K Q J 10 of both spades and hearts in 
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a 13-card hand, and this probability is easy to compute: 


ot 
P(Ag MN Aq) = (3) 
s H a 
13 
What we need here is a way of expressing P(Ag U Ag UAp U Ag) in 


terms of the individual terms P(Ag) etc., and the intersections P(Ag MN Aq) 
etc. We know that 


P(A U B) = P(A) + P(B) — P(A OB) 
If we have three events, then 
P(AUBUC)=P(A U(BUC)) = P(A + P(BUC)—P(AN(BUC)) 
= P(A) + P(BUC)— P(A NB)U(AN OC) 
= P(A) + P(B) + P(C) — P(A OB) — P(A NC) 
~P(BAC)+P(ANBOC) 
The general pattern is now clear and may be verified by induction. 
P(A, U++* UA,) = > P(A) —2 PA, -\ A;) 
+ Dd, Plas A; O Ay) — ++ + (—1)" P(A, N+ OA,) (1.4.5) 


In the present problem the intersections taken three or four at a time are 
empty; hence, by symmetry, 


(9) _ (7) 
pe 0 eee 
(1) 
13 
It is illuminating to consider an incorrect approach to this problem. 
Suppose that we first pick a suit (four choices); we then select the AK QJ 10 
of that suit. The remaining eight cards can be anything (if they include the 
A K QJ 10 of another suit, the condition that at least one A K Q J 10 of the 
same suit be obtained will still be satisfied). Thus we have (%7) choices, so 
that the desired probability is 4(4)/(3). 


The above procedure illustrates multiple counting, the nemesis of the com- 
binatorial analyst (see Figure 1.4.5). 
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(1) Spades Hearts Diamonds Clubs 
/ 
‘ 
jf 
/ | 
] 
po} 
/ ] 
42) } AKQJ10 of clubs, 
say 2 of diamonds, ——— AK QJ 10 of spades, 
2 of spades, 2 of 2 of diamonds, spades, clubs 
clubs 


Ficure 1.4.5 Multiple Counting. 


In writing p = 4(%)/(@3) we are saying that there is a one-to-one corre- 
spondence between paths in Figure 1.4.5 and favorable outcomes. But this 
is not the case, since the two paths indicated in the diagram both lead to the 
same result, namely, A K Q J 10 of spades, A K Q J 10 of clubs, 2 of dia- 
monds, 2 of spades, 2 of clubs. In fact there are 6(#) such duplications. For 
we can pick the two suits in ($) = 6 ways; then, after taking A K Q J 10 of 
each suit, we select the remaining three cards in (4) ways. If we subtract the 
number of duplications, 6(4), from the original count, 4(%), we obtain the 
correct result. 

To rephrase: the counting process we have proposed counts the number of 
paths in the above diagram, that is, the number of choices at Step 1 times the 
number of choices at Step 2. However, the paths do not in general lead to 
distinct “‘results,’’ namely, distinct bridge hands. < 


PROBLEMS 


1. If a 3-digit number (000 to 999) is chosen at random, find the probability that 
exactly 1 digit will be >5. 
2. Find the probability that a five-card poker hand will be: 
(a) A straight (five cards in sequence regardless of suit; ace may be high but 
not low). 
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10. 


11. 
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(b) Three of a kind (three cards of the same face value x, plus two cards with 
face values y and z, with 2, y, z distinct). 

(c) Two pairs (two cards of face value x, two of face value y, and one of face 
value z, with x, y, z distinct). 


. An urn contains 3 red, 8 yellow, and 13 green balls; another urn contains 5 


red, 7 yellow, and 6 green balls. One ball is selected from each urn. Find the 
probability that both balls will be of the same color. 


. An experiment consists of drawing 10 cards from an ordinary 52-card pack. 


(a) If the drawing is done with replacement, find the probability that no two 
cards will have the same face value. 

(b) If the drawing is done without replacement, find the probability that at 
least 9 cards will be of the same suit. 


. An urn contains 10 balls numbered from 1 to 10. Five balls are drawn without 


replacement. Find the probability that the second largest of the five numbers 
drawn will be 8. 


. m men and w women seat themselves at random in m + w seats arranged in a 


row. Find the probability that all the women will be adjacent. 


. If a box contains 75 good light bulbs and 25 defective bulbs and 15 bulbs are 


removed, find the probability that at least one will be defective. 


. Eight cards are drawn without replacement from an ordinary deck. Find the 


probability of obtaining exactly three aces or exactly three kings (or both). 


. (The game of rencontre). An urn contains n tickets numbered 1, 2,..., 7. 


The tickets are shuffled thoroughly and then drawn one by one without re- 
placement. If the ticket numbered r appears in the rth drawing, this is denoted 
as a match (French: rencontre). Show that the probability of at least one match 
is 

1 1 (—1)""! 


i +1—e! as nNn—> © 


a aye n! 


A “language” consists of three “words,” W, =a, W, = ba, W; = bb. Let 

N(k) be the number of “‘sentences’”’ using exactly k letters (e.g., N(1) = 1 Ge., 

a), N(2) =3 (aa, ba, bb), N(3) =5 (aaa, aba, abb, baa, bba); no space is 

allowed between words). 

(a) Show that N(k) = N(kK — 1) + 2N(k — 2), k =2,3,... (define N(O) = 
1). 

(b) Show that the general solution to the second-order homogeneous linear 
difference equation (a) [with N(O) and N(1) specified], is N(kK) = A2* + 
B(—1)*, where A and B are determined by N(O) and N(1). Evaluate A 
and B in the present case. 


(The birthday problem) Assume that a person’s birthday is equally likely to 
fall on any of the 365 days in a year (neglect leap years). If r people are selected, 
find the probability that all r birthdays will be different. Equivalently, if r balls 
are dropped into 365 boxes, we are looking for the probability that no box will 
contain more than one ball. It turns out that the probability is less than 1/2 


12. 


13. 


14. 
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for r > 23, so that in a class of 23 or more students the odds are that two or 
more people will have the same birthday. 


Fourteen balls are dropped into six boxes. Find the number of arrangements 
(ordered samples of size 14 with replacement, from six symbols) whose 
occupancy numbers coincide with 4, 4, 2,2, 2,0 in some order (i.e., boxes i, 
and i, contain four balls, boxes i,, i,, and i, two balls, and box 7, no balls, for 
some i,,..., ig). 


(a) Let © be a set with n elements. Show that there are 2” subsets of ©. For 
example, if Q = {1, 2, 3}, the subsets are @, {1}, {2}, {3}, {1, 2}, {1, 3}, 
{2, 3}, and {1, 2, 3} = Q; 28 = 8 altogether. 

(b) How many ways are there of selecting ordered pairs (A, B) of subsets of 
Q such that A ¢ B? For example, A = {1}, B = {1, 3} gives such a pair, 
but A = {1,2}, B = {1, 3} does not. 

Let Q be a finite set. A partition of Q is an (unordered) set {4;,...,A,}, where 

the A; are nonempty subsets whose union is 2. For example, if Q = {1, 2, 3}, 

there are five partitions. 

A, = {1, 2, 3} 

A, = ile 2}, A, = {3} 

A, = {1, 3}, Ae = {2} 

A, = {1}, Ap = {2, 3} 

A, = {1}, Ap = {2}, As = {3} 


Let g(n) be the number of partitions of a set with m elements. 

(a) Show that g(n) = pS ("7 1)e(k) [define 2(0) = 1]. 

(b) Show that g(”) = e* >}? k"/k! 

HINT: Show that the series satisfies the difference equation of part (a). 


1.5 INDEPENDENCE 


Consider the following experiment. A person is selected at random and his 


height is recorded. After this the last digit of the license number of the next 
car to pass is noted. If A is the event that the height is over 6 feet, and B is 
the event that the digit is >7, then, intuitively, A and B are “independent’”’ 
in the sense that knowledge about the occurrence or nonoccurrence of one 


of the events should not influence the odds about the other. For example, 


say that P(A) = .2, P(B) = .3. In a long sequence of trials we would expect 
the following situation. 


(Roughly) 20% of the time 4 80% of the time A does not occur; 
occurs; of those cases in which A of these cases: 

occurs: 

30% B occurs 30% B occurs 


70% B does not occur 70% B does not occur 
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Thus, if B is independent of A, it appears that P(A A B) should be .2(.3) = 
.06 = P(A)P(B), and P(A° 1 B) should be .8(.3) = .24 = P(A) P(B). 

Conversely, if P(A O B) = P(A)P(B) = .06 and P(A° 1 B) = P(A)P(B) 
= .24, then, if A occurs roughly 20% of the time and we look at only the 
cases in which A occurs, B must occur in roughly 30% of these cases in order 
to have A M1 B occur 6% of the time. Similarly, if we look at the cases in 
which A does not occur (80%), then, since we are assuming that A° 1 B 
occurs 24% of the time, we must have B occurring in 30% of these cases. 
Thus the odds about B are not changed by specifying the occurrence or non- 
occurrence of A. 

It appears that we should say that event B is independent of A iff P(A 1 B) 
== P(A)P(B) and P(A° 1 B) = P(A)P(B). However, the second condition is 
already implied by the first. If P(A A B) = P(A)P(B), 


P(Ao CO B) = P(B — A) = P(B — (A OB)) = P(B) — P(A OB) 
since A (© Bisa subset of B; hence 
P(Ao CO B) = P(B) — P(A)P(B) = (1 — P(A))P(B) = P(ADP(B) 


Thus B is independent of A; that is, knowledge of A does not influence the 
odds about B, iff P(A)P(B) = P(A O B). But this condition is perfectly 
symmetrical, in other words, B is independent of A iff A is independent of B. 
Thus we are led to the following definition. 


DEFINITION. Two events A and B are independent iff P(A CB) = P(A)P(B). 


If we have three events A, B, C that are (intuitively) independent, knowl- 
edge of the occurrence or nonoccurrence of A 1 B, for example, should not 
change the odds about C; this leads as above to the requirement that 
P(A NBOC)= P(A 0 B)P(C). But if A, B, and C are to be independent, 
we must expect that A and B are independent (as well as A and C, and B. 
and C), so we should have all of the following conditions satisfied. 


P(4 OB) =P(A)P(B), P(A NC)= P(A)P(O), 


P(B OC) = P(B)P(C) 
and 
P(A ON BOC) = P(A)P(B)P(C) 


We are led to the following definition. 
DEFINITION. Let A;, i¢ J, where I is an arbitrary index set, possibly in- 


finite, be an arbitrary collection of events [a fixed probability space 
(Q, F, P) is of course assumed]. 
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The A, are said to be independent iff for each finite set of distinct indices 
l1,...,%€1 we have 


P(A;, O Ay, +++ OV A;,) = P(A,)P(Ai,) + * PCAs) 


REMARKS 
1. If the A,, ie J, are independent, it follows that 


P(B;, Cy ee OV Bz) = P(B,) + PB) 


for all (distinct) i,,...,%,, where each B; may be either A; or A,” 
To put it simply, if the A; are independent and we replace any event 
by its complement, we still have independence [see Problem 1; 
actually we have already done most of the work by showing that 
P(A CO B) = P(A)P(B) implies P(A° 0 B) = P(A) P(B)). 
2. The condition P(A; N::: OA,) = P(A,):::P(A,) does not 
imply the analogous condition for any smaller family of events. For 
example, it is possible to have P(A NA B CO C) = P(A)P(B)P(C), but 
P(A OB) + P(A)P(B), P(A OC) + P(ADP(C), P(B OC) + 
P(B)P(C). In particular, A, B, and C are not independent. 
Conversely it is possible to have, for example, P(A O B) = 
P(A)P(B), P(A O C) = P(A)P(C), P(B OO C) = P(B)P(C), but 
P(A OB OC) + P(A)P(B)P(C). Thus A and B are independent, as 
are A and C, and also B and C, but A, B, and C are not indepen- 
dent. 


p> Example 1. Let two dice be tossed, and take (2 = all ordered pairs 
(i,j), i,j =1,2,...,6, with each point assigned probability 1/36. 
Let 
A = {first die = 1,2, or 3} 
B = {first die = 3, 4, or 5} 
C = {the sum of the two faces is 9} 


(Thus A 1 B = {(3, 1), (3, 2), (, 3), (3, 4), G3, 5), (3, 6)}, A NC= 
{(3,6)}}, BNC = {(3,6), (4,5), (5,4}, AN BOC = {(, 6)}.) 


Then 
P(A 0 B) =% 4 P(A)P(B) = 32) =F 
P(A OC) = 3¢ A P(AVP(C) = 25%) = as 
P(B OC) = 7s X PYB)P(C) = 3) = as 
But 


P(A N BOC) = 35 = P(A)P(B)P(C) 
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Now in the same probability space let 


A = {first die = 1, 2, or 3} 
B = {second die = 4, 5, or 6} 
C = {the sum of the two faces is 7} 


(Thus A OC = {(1, 6), (2, 5), (3, 4)} = A AN BOC, etc.) Then 


P(A C1 B) = ¢ = P(A)P(B) = 3(8) 
P(A OC) = yy = P(A)P(C) = 2) 
P(B OC) = ws = P(B)P(C) = 3) 
But 
PANBOC)=75 4 P(A)P(B)P(C) = < 


We illustrate the idea of independence by considering some problems 
related to the classical coin-tossing experiment. 

A sequence of Bernoulli trials is a sequence of n independent observations, 
each of which may result in exactly one of two possible situations, called 
“‘success’’ or “‘failure.’” At each observation the probability of success is p, 
and the probability of failure is g = 1 — p. 


SPECIAL CASES 
(a) Toss a coin independently n times, with success = heads, failure 


= tails. 

(b) Examine components produced on an assembly line; success = 
acceptable, failure = defective. 

(c) Transmit binary digits through a communication channel; 
success = digit recieved correctly, failure = digit received in- 
correctly. 


We take Q = all 2” ordered sequences of length n, with components 0 
(failure) and 1 (success). To assign probabilities in accordance with the 
physical description given above, we reason as follows. 

Consider the sample point o = 11:--10---0 (Kk 1’s followed by n —k 
0’s). Let A; = {success on trial i} = the set of all sequences with a 1 in the 
ith coordinate. Because of the independence of the trials we must assign 


P{ow} = P(A, N Ag N+ + O AQ O Agu A°** OA,’) 
= P(A,)P(A2)** + P(A,)P (Aj41) -++ P(A,°) = p*qr™ 


Similarly, any point with k I’s and n — k 0’s is assigned probability p*q"—. 
The number of such points is the number of ways of selecting k distinct 
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positions for the 1’s to occur (or selecting n — k distinct positions for the 
0’s); that is, ({). The sum of the probabilities assigned to all the points is 


n n— n 
> ( )r* "“=(p+q)"=1 
k=0 \k 

by the binomial theorem. Thus we have a legitimate assignment. Further- 
more, the probability of obtaining exactly k successes is 


we (7) pig +k =0,1,...,7 (1.5.1) 


p(k), k =0,1,...,n, is called the binomial probability function. 


p> Example 2. Six balls are tossed independently into three boxes A, B, C. 
For each ball the probability of going into a specific box is 1/3. Find the 
probability that box A will contain (a) exactly four balls, (b) at least two 
balls, (c) at least five balls. 

Here we have six Bernoulli trials, with success corresponding to a ball in 
box A, failure to a ball in box B or C. Thus n = 6, p = 1/3, g = 2/3, and 
so the required probabilities are 


(a) p(4) = (@)@)4@)? 
(b) 1 — p(0) — p(l) = 1 — @)° — M@H*® 
(Cc) p(S) + p(6) = (5)()°(@) + @)® < 


We now consider generalized Bernoullli trials. Here we have a sequence 
of independent trials, and on each trial the result is exactly one of the k 
possibilities b,,...,5,. On a given trial let b; occur with probability p,, 
j= 1,2,...,k (p; > 0, >*, p; = 1). 

We take Q = all k” ordered sequences of length n with components 
b,,...,5,; for example, if w = (b,b35,b,°--) then 5, occurs on trial 1, 
b, on trial 2, b, on trials 3 and 4, and so on. As in the previous situation, 
assign to the point 


wo = (b,b, +++ bby +++ bg + + by + + Dg) 


the probability p,"1p."* ->- p,”*. This is the probability assigned to any 
sequence having 1, occurrences of b,,i = 1,2,...,k. To find the number of 
such sequences, first select m, positions out of n for the b,’s to occur, then 
Nz positions out of the remaining n — n, for the by’s, ng out of n — nm — ng 
for the b,’s, and so on. Thus the number of sequences having exactly n, 
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occurrences of b,,...,, occurrences of 5, is 
yy) a, | 
ny Ng Ng Nyt Ni 


The total probability assigned to all points is 


n! 
Pr? Po" + + Dy” = (Pi +++ + D,)” = 
Nisese; m,nonneg 71,!nNo!°-- nN! 
k 
a . 
integers, wit Bn, n (1.5.2) 
To see this, notice that (p; +--+ + p,)” = (Pi + - op Pr)(Pr + ob 
Px) ** (Pi +++ + py), n times. A ed term in the expansion is pt 
Py”*; the number of times this term appears is 
n!} 
n,!ng!---n,! 


since we may count the appearances by selecting p, from n, of the n factors, 
selecting p, from n, of the remaining n — n, factors, and so forth. Thus we 
have a legitimate assignment of probabilities. 
The probability that 5, will occur m, times, 5, will occur n, times,..., 
and 6, will occur 7, times is 
n! es 


P(ny, ..., Ny) = ———— _ Pr 


* Dy" (1.5.3) 
n!---n,! 


P(t, - ++» My), M1, +--+ 5M, = nonnegative integers whose sum is n, is called 
the multinomial probability function. 

Note that when k = 2, generalized Bernoulli trials reduce to ordinary 
Bernoulli trials [let b, = “success,” b, = “failure’’, py = p, pp =q =1-—p, 
nm =k, ng =n —k; then (a!/n,! ng!)pi"p.” = ()p"g” * = probability of 
k successes in 7 trials]. 


> Example 3. Throw four unbiased dice independently. Find the proba- 
bility of exactly two 1’s and one 2. 
Let 


b, = “1 occurs’ (on a given trial) py =@% nN, = 2 
b, = “2 occurs” pPo=F nh, = 1, 
b, = “3,4, 5, or 6 occurs”’ P3 = 3 Ag 1 


The probability is (41/2! 1! 1!) (1/6)2(1/6)1(2/3)! = 1/27. < 
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> Example 4. If 10 balls are tossed independently into five boxes, with a 
given ball equally likely to fall into each box, find the probability that all 
boxes will have the same number of balls. 

Let b, = “ball goes into box i”’, p; = 1/5, n; = 2,i = 1, 2,3, 4,5,n = 10. 
The probability = (10!/2°)(1/5)**. < 


REMARK. If r balls are tossed independently into n boxes, we have seen that 
the event {ball 1 into box i,, ball 2 into box i, ... , ball r into box i,} 
must have probability p, p;,° °° p:,. Where p; is the probability that a 
specific ball will fall into box i. In particular, if all p; = 1/n (as is 
assumed in Examples 2 and 4 above), the probability of the event is 
1/n™. In other words, all ordered samples of size r (out of n symbols), 
with replacement, have the same probability. This justifies the 
assertion we made in Example 3 of Section 1.4. 

We emphasize that the independence of the tosses 1s an assumption, 
notatheorem. For example, if two balls are tossed into two boxes anda 
given box can contain at most one ball, then the events A, = {ball 1 
goes into box 1} and A, = {ball 2 goes into box 1} are not independ- 
ent, since P(A, \ A,) = 0, P(A,) = P(A.) = 1/2. 


p> Example 5. An urn contains equal numbers of black, white, red, and 
green balls. Four balls are drawn independently, with replacement. Find 
the probability p(k) that exactly k colors will appear in the sample, k = 
12,554, 

This is a multinomial problem with n = 4 and 6, = B = black, b, = W = 
white, b; = R = red, b, = G = green. 

k = 4: The probability that all four colors will appear is given by the 
multinomial formula with all n, = 1; that is, 


at 34) Gea apy 
vere Fh - =a 


k = 3: The probability of obtaining two black, one white, and one red 
ball is given by the multinomial formula with n, = 2, n, =n, = 1, nm = 0; 


that is, 
at ) _3 
2! 1! re 64 


To find the total probability of obtaining exactly three colors, multiply by 
the number of ways of selecting three colors out of four [(§) = 4] and the 
number of ways of selecting one of three colors to be repeated (3). Thus 


p(3) = 36/64. 


32 BASIC CONCEPTS 


k = 2: The probability of obtaining two black and two white balls is 


Baas _ 3 
2!2!o0!0!l4) ~— 128 


Thus the probability of obtaining two balls of one color and two of another is 
3/128 times the number of ways of selecting two colors out of 4[(§) = 6], 
or 9/64. Similarly, the probability of obtaining three of one color and one of 


another is 
4! 
Trorala OO % 


Notice that the extra factor is (4)(3) = 12, not ; == 6, since three blacks 
and one white constitute a different selection from three whites and one black. 
Thus 


p(2) = 9/64 + 12/64 = 21/64. 
k = 1: The probability that all balls will be of the same color is 


4! mo 
PO) = For oto! (; a 64 


REMARK. The sample space of this problem consists of all ordered samples 
of size 4, with replacement, from the symbols B, W, R, G, with all 
samples assigned the same probability. The reader should resist the 
temptation to assign equal probability to all unordered samples of 
size 4, with replacement. This would imply, for example, that 
{WWWW} and {WWWB, WWBW, WBWW, BWWW} = {three 
whites and one black} have the same probability, and this is inconsistent 
with the assumption of independence. < 


PROBLEMS 


1. Show that the events 4;, i € I, are independent iff P(B;,, N+: OB, ) = PB;,) °°: 
P(B,,) for all (distinct) i, ..., i,, where each B;_ may be either A; or A, °. 


2. Let p(k), k =0,1,...,, be the binomial probability function. 
(a) If (2 + 1)p is not an integer, show that p(k) is strictly increasing up to 
k = [(a + 1)p] = the largest integer < (x + 1)p, and attains a maximum 
at [(7 + 1)p]. p(k) is strictly decreasing for all larger values of k. 
(b) If (7 + 1)p is an integer, show that p(k) is strictly increasing up to k = 
(n + 1)p — 1 and has a double maximum at k = (n + 1)p — 1 and k = 
(n + 1)p; p(k) is strictly decreasing for larger values of k. 
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3. A single card is drawn from an ordinary deck. Give examples of events A and B 
associated with this experiment that are 
(a) Mutually exclusive (disjoint) but not independent 
(b) Independent but not mutually exclusive 
(c) Independent and mutually exclusive 
(d) Neither independent nor mutually exclusive 


4. Of the 100 people in a certain village, 50 always tell the truth, 30 always lie, and 
20 always refuse to answer. A sample of size 30 is taken with replacement. 
(a) Find the probability that the sample will contain 10 people of each category. 
(b) Find the probability that there will be exactly 12 liars. 


5. Six unbiased dice are tossed independently. Find the probability that the number 
of 1’s minus the number of 2’s will be 3. 


6. How many terms are there in the multinomial expansion (1.5.2)? 


. An urn contains ¢, balls of color C;, t, of color C,,..., ¢,, of color C,. 
(a) If m balls are drawn without replacement, show that the probability of 
obtaining exactly m, of color C,, m, of color Cy, ... , m, of color C;, is 


fo a, 
(7) 


~~] 


where ¢ =¢, +f, +:-- +4, is the total number of balls in the urn and 
Ma is defined to be O ifn; > ¢;. (Notice the pattern: ¢; +f, +-:':' +h =t, 
ny, + Ng +++ +n, =n.) The above expression, regarded as a function 


of m,,..., M, is called the hypergeometric probability function. 
(b) What is the probability of the event of part (a) if the balls are drawn inde- 
pendently, with replacement? 
8. (a) If an event A is independent of itself, that is, if A and A are independent, 
show that P(A) = 0 or 1. 


(b) If P(A) = 0 or 1, show that A and B are independent for any event B, in 
particular, that A and A are independent. 


1.6 CONDITIONAL PROBABILITY 


If A and B are independent events, the occurrence or nonoccurrence of A 
does not influence the odds concerning the occurrence of B. If A and B are 
not independent, it would be desirable to have some way of measuring exactly 
how much the occurrence of one of the events changes the odds about the 
other. 

In a long sequence of independent repetitions of the experiment, P(A) 
measures the fraction of the trials on which A occurs. If we look only at the 
trials on which A occurs (say there are 14 of these) and record those trials 


34 BASIC CONCEPTS 


on which B occurs also (there are n4p of these, where n4p is the number 
of trials on which both A and B occur), the ratio n4,/n, is a measure of 
P(B| A), the “conditional probability of B given A,’ that is, the fraction of 
the time that B occurs, looking only at trials producing an occurrence of A. 
Comparing P(B | A) with P(B) will indicate the difference between the odds 
about B when A is known to have occurred, and the odds about B before any 
information about A is revealed. 

The above discussion suggests that we define the conditional probability 
of B given A as 

P(B | A= P(A OB) 
P(A) 


This makes sense if P(A) > 0. 


(1.6.1) 


p> Example 1. Throw two unbiased dice independently. Let A = {sum of 
the faces = 8}, B = {faces are equal}. Then 


P(B | A) = ——— = —______4______ = “= == 
P(A) P{4 —4,5—3,3—5,6—2,2—6} 5/36 5 
(see Figure 1.6.1). 

There is a point here that may be puzzling. In counting the outcomes 
favorable to A, we note that there are two ways of making an 8 using a 5 and 
a 3, but only one way using a 4 and a 4. The probability space consists of 
all 36 ordered pairs (i, j), i, 7 = 1,2, 3, 4,5, 6, each assigned probability1/36. 
The ordered pair (4, 4) is the same as the ordered pair (4, 4) (this is rather 
difficult to dispute), while (5, 3) is different from (3, 5). Alternatively, think 
of the first die to be thrown as red and the second as green. A 5 on the red 
die and a 3 on the green is a different outcome from a 3 on the red and a 5 on 
the green. However, using 4’s we can make an 8 in only one way, a 4 on the 
red followed by a 4 on the green. < 


The extension of the definition of conditional probability to events with 
probability zero will be considered in great detail later on. For now, we are 


¢ 


26 
Figure 1.6.1 Example on Conditional Probability. 


Numbers Indicate Favorable Outcomes. 


1.6 CONDITIONAL PROBABILITY 35 


content to note some consequences of the above definition [whenever an 
expression such as P(B| A) is written, it is assumed that P(A) > 0]. 

If A and B are independent, then P(A 1 B) = P(A)P(B), so that 
P(B| A) = P(B) and P(A|B) (= P(A 2 B)/P(B)) = P(A), in accordance 
with the intuitive notion that the occurrence of one of the events does not 
change the odds about the other. 

The formula P(A AM B) = P(A)P(B| A) may be extended to more than two 
events. | 

P(A NBOC)=P(A NB)P(C|A OB) 


Hence 
P(A AB OC) = P(A)P(B| A)P(C| A 2B) (1.6.2) 
Similarly 
P(A NBOACOD) = P(A)P(B| A)P(C| A A B)P(D|ANBOC) 
(1.6.3) 
and so on. 


> Example 2. Three cards are drawn without replacement from an ordinary 
deck. Find the probability of not obtaining a heart. 
Let A; = {card jis not a heart}. Then we are looking for 


P(A, 1 Az 1 As) = P(A,)P(A,| Ay)P(As | Ax OV Ag) = 33 $4 30 


[For example, to find P(A, | A,), we restrict ourselves to the outcomes favor- 
able to A,. If the first card is not a heart, 51 cards remain in the deck, in- 
cluding 13 hearts, so that the probability of not getting a heart on the second 
trial is 38/51.] 

Notice that the above probability can be written (*?)/(°?), which could have 
been derived by direct combinatorial reasoning. Furthermore, if the cards 
were drawn independently, with replacement, the probability would be quite 
different, (3/4)? = 27/64. < 


We now prove one of the most useful theorems of the subject. 


Theorem of Total Probability. Let B,, By,... be a finite or countably 
infinite family of mutually exclusive and exhaustive events (i.e., the B,; are 
disjoint and their union is Q). If A is any event, then 


P(A) = 5 P(A 1 B,) (1.6.4) 


Thus P(A) is computed by finding a list of mutually exclusive, exhaustive 
ways in which A can happen, and then adding the individual probabilities. 
Also 


P(A) = > P(B)P(A | B,) (1.6.5) 


36 BASIC CONCEPTS 


where the sum is taken over those i for which P(B;) > 0. Thus P(A) is a 
weighted average of the conditional probabilities P(A | B,). 


PROOF. 
P(A) = (PAN Q) = P(A la (U B.) = P(U (AN B)) 


=> P(A A B,) = > P(B)P(A | B) 


Notice that under the above assumptions we have 


P(A 1 By) _ P(B,)P(A | By) 


P(B,| A) = P(A) > P(B)P(A | B,) 


(1.6.6) 


This formula is sometimes referred to as Bayes’ theorem; P(B, | A) is some- 
times called an a posteriori probability. The reason for this terminology may 
be seen in the example below. 


p> Example 3. Two coins are available, one unbiased and the other two- 
headed. Choose a coin at random and toss it once; assume that the unbiased 
coin is chosen with probability 3/4. Given that the result is heads, find the 
probability that the two-headed coin was chosen. 

The “tree diagram’ shown in Figure 1.6.2 represents the experiment. 

We may take 2 to consist of the four possible paths through the tree, with 
each path assigned a probability equal to the product of the probabilities 
assigned to each branch. Notice that we are given the probabilities of the 


H 
1/2 
unbiased. 
1/2 
T 
H 
1 
two-headed 
0 
T 


Ficure 1.6.2 Tree Diagram. 
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events B, = {unbiased coin chosen} and B, = {two-headed coin chosen}, as 
well as the conditional probabilities P(A | B;), where A = {coin comes up 
heads}. This is sufficient to determine the probabilities of all events. 

Now we can compute P(B, | A) using Bayes’ theorem; this is facilitated if, 
instead of trying to identify the individual terms in (1.6.6), we simply look at 
the tree and write 


P(B, (A) 
P(A) 


__ P{two-headed coin chosen and coin comes up heads} 


P(B, | A) = 


P{coin comes up heads} 
___ GAM) _ 2 4 
(3/4)(1/2) + 1/4)1) 5 


There are many situations in which an experiment consists of a sequence 
of steps, and the conditional probabilities of events happening at step n + 1, 
given outcomes at step n, are specified. In such cases a description by means 
of a tree diagram may be very convenient (see Problems). 


> Example 4. A loaded die is tossed once; if N is the result of the toss, then 
P{N =i} =p,,i=1, 2, 3, 4, 5, 6. If N =i, an unbiased coin is tossed 
independently i times. Find the conditional probability that N will be odd, 
given that at least one head is obtained (see Figure 1.6.3). 
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FIGURE 1.6.3 
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Let A = {at least one head obtained}, B = {N odd}. Then P(B| A) = 
P(A 2 B)|P(A). Now 


P(A OB) = P{N = i and at least one head obtained} 


4=1,3,5 
vee Zz 31 
= 3P, + sP3 + SeD5 
4 


since when an unbiased coin is tossed independently 7 times, the probability 
of at least one head is 1 — (1/2)*. Similarly, 


6 


P(A) = > P{N = i and at least one head obtained} 


7=1 


6 
= 2 Pil = 2") 
Thus 
ip. + ia + 
P(B| A) = BES < 
$P1 + £Po + $P3 + TePs + 33Ds + $206 


PROBLEMS 


1. In 10 Bernoulli trials find the conditional probability that all successes will 
occur consecutively (i.e., no two successes will be separated by one or more 
failures), given that the number of successes is between four and six. 


2. If X is the number of successes in m Bernoulli trials, find the probability that 
X > 3 given that X > 1. 


3. An unbiased die is tossed once. If the face is odd, an unbiased coin is tossed 
repeatedly; if the face is even, a biased coin with probability of heads p # 1/2 is 
tossed repeatedly. (Successive tosses of the coin are independent in each case.) 
If the first m throws result in heads, what is the probability that the unbiased 
coin is being used? 

4. A positive integer J is selected, with P{J = n} = (1/2)", 2» =1,2,....1f 7 
takes the value n, a coin with probability of heads e~ is tossed once. Find the 
probability that the resulting toss will be a head. 


5. A bridge player and his partner are known to have six spades between them. 
Find the probability that the spades will be split 
(a) 3-3 
(b) 4-2 or 2-4 
(c) 5-1 or 1-5 
(d) 6-0 or 0-6. 
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6. An urn contains 30 white and 15 black balls. If 10 balls are drawn with (respec- 
tively without) replacement, find the probability that the first two balls will be 
white, given that the sample contains exactly six white balls. 


7. Let C, be an unbiased coin, and C, a biased coin with probability of heads 3/4. 
At time ¢ = 0, C, is tossed. If the result is heads, then C, is tossed at time 
t =1; if the result is tails, C, is tossed at t = 1. The process is repeated at 
t =2,3,... . In general, if heads appears at ¢ =n, then C, is tossed at ¢ = 
n + 1; if tails appears at ¢ = n, then C, is tossed att =n +1. 
Find y,, = the probability that the toss at ¢ = will be a head (set up a 
difference equation). 


8. In the switching network of Figure P.1.6.8, the switches operate independently. 


A 


FiGurE P.1.6.8 


Each switch closes with probability p, and remains open with probability 1 — p. 

(a) Find the probability that a signal at the input will be received at the output. 

(b) Find the conditional probability that switch E is open given that a signal is 
received. 

9. In a certain village 20% of the population has disease D. A test is administered 
which has the property that if a person has D, the test will be positive 90% 
of the time, and if he does not have D, the test will still be positive 30% of the 
time. All those whose test is positive are given a drug which invariably cures the 
disease, but produces a characteristic rash 25% of the time. Given that a person 
picked at random has the rash, what is the probability that he actually had D 
to begin with? 


1.7 SOME FALLACIES IN COMBINATORIAL PROBLEMS 


In this section we illustrate some common traps occurring in combinatorial 
problems. In the first three examples there will be a multiple count. 


p Example 1. Three cards are selected from an ordinary deck, without 
replacement. Find the probability of not obtaining a heart. 
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PROPOSED SOLUTION. The total number of selections is (2). To find the 
number of favorable outcomes, notice that the first card cannot be a heart; 
thus we have 39 choices at step 1. Having removed one card, there are 38 
nonhearts left at step 2 (and then 37 at step 3). The desired probability is 
(39)(38)(37)/(). 


FALLACY. In computing the number of favorable outcomes, a particular 
selection might be: 9 of diamonds, 8 of clubs, 7 of diamonds. Another 
selection is: 8 of clubs, 9 of diamonds, 7 of diamonds. In fact the 3! = 6 
possible orderings of these three cards are counted separately in the numera- 
tor (but not in the denominator). Thus the proposed answer is too high by a 
factor of 3!; the actual probability is (39)(38)(37)/3! () = @)/@€?) (see 
example 2, Section 1.6). < 


> Example 2. Find the probability that a five-card poker hand will result 
in three of a kind (three cards of the same face value x, plus two cards of face 
values y and z, with x, y, and z distinct). 


PROPOSED SOLUTION. Pick the face value to appear three times (13 
possibilities). Pick three suits out of four for the “three of a kind”? ((§) 
choices). Now one face value is excluded, so that 48 cards are left in the deck. 
Pick one of them as the fourth card; the fifth card can be chosen in 44 ways, 
since the fourth card excludes another face value. Thus the desired prob- 
ability is (13)($)(48)(44)/(°?). 


FALLACY. Say the first three cards are aces. The fourth and fifth cards 
might be the jack of clubs and the 6 of diamonds, or equally well the 6 of 
diamonds and the jack of clubs. These possibilities are counted separately in 
the numerator but not in the denominator, so that the proposed answer is 
too high by a factor of 2. The actual probability is 13(9)(48)(44)/2(?) = 
13(3)(2)16/(?) [see Problem 2, Section 1.4; the factor (#?)16 corresponds to 
the selection of two distinct face values out of the remaining 12, then one 
card from each of these face values]. 


REMARK. A more complicated approach to this problem is as follows. 
Pick the face value x to appear three times, then pick three suits out 
of four, as before. Forty-nine cards remain in the deck, and the total 
number of ways of selecting two remaining cards is (“?). However, if 
the two face values are the same, we obtain a full house; there are 
12() selections in which this happens (select one face value out of 12, 
then two suits out of four). Also, if one of the two cards has face value 
x, we obtain four of a kind; since there is only one remaining card 
with face value x and 48 cards remain after this one is chosen, there 
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are 48 possibilities. Thus the probability of obtaining three of a kind 
is 
"()LG)-26)-*| 
3 2 2 
52 
5) 
(This agrees with the previous answer.) < 


> Example 3. Ten cards are drawn without replacement from an ordinary 
deck. Find the probability that at least nine will be of the same suit. 


PROPOSED SOLUTION. Pick the suit in any one of four ways, then choose 
nine of 13 face values. Forty-three cards now remain in the deck, so that the 
desired probability is 4(13)43/(°3). 


FALLACY. Consider two possible selections. 

1. Spades are chosen, then face values A K QJ 109 8 7 6. The last card 
is the 5 of spades. 

2. Spades are chosen, then face values A K QJ 109 87 5. The last card 
is the 6 of spades (see Figure 1.7.1). Both selections yield the same 10 cards, 
but are counted separately in the computation. To find the number of 
duplications, notice that we can select 10 cards out of 13 to be involved in the 
duplication; each choice of one card (out of 10) for the last card yields a 
distinct path in Figure 1.7.1. Of the 10 possible paths corresponding to a given 
selection of cards, nine are redundant. Thus the actual probability is 


‘L(s}~ (oo)? 
(%) 


AKQJ109876 AKQJ109875 


5 of spades 6 of spades 


Ficure 1.7.1 Multiple Count. 
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Now 
(2)—(3)o= (2)4 
9 10 9 


3) ie (!3)> 
10 10 


so that the probability is 


as obtained in a straightforward manner in Problem 4, Section 1.4. < 


p> Example 4. An urn contains 10 balls 5,,...,b, 9. Five balls are drawn 
without replacement. Find the probability that bg and b, will be included in 
the sample. 


PROPOSED SOLUTION. We are drawing half the balls, so that the proba- 
bility that a particular ball will be included is 1/2. Thus the probability of 
including both b, and by is (1/2)(1/2) = 1/4. 


FALLACY. Let A = {bg is included}, B = {bg is included}. The difficulty 
is simply that A and B are not independent. For P(A ON B) = (8)/(?) = 2/9 
(after bg and 5, are chosen, three balls are to be selected from the remaining 
eight). Also P(A) = P(B) = ()/(S) = 1/2, so that P(A A B) ¥ P(A)P(B). < 


> Example 5. Two cards are drawn independently, with replacement, from 
an ordinary deck; at each selection all 52 cards are equally likely. Find the 
probability that the king of spades and the king of hearts will be chosen (in 
some order). 


PROPOSED SOLUTION. The number of unordered samples of size 2 out of 
52, with replacement, is (°?+?1) = (3) [see (1.4.4)]. The kings of spades 
and hearts constitute one such sample, so that the desired probability is 
1/(3). 

FALLACY. It is not legitimate to assign equal probability to all unordered 
samples with replacement. If we do this we are saying, for example, that the 
outcomes “ace of spades, ace of spades” and “king of spades, king of hearts” 
have the same probability. However, this cannot be the case if independent 
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sampling is assumed. For the probability that the ace of spades is chosen twice 
is (1/52)?, while the probability that the spade and heart kings will be chosen 
(in some order) is P{first card is the king of spades, second card is the king 
of hearts} + P{first card is the king of hearts, second card is the king of 
spades} = 2(1/52)?, which is the desired probability. 

The main point is that we must use ordered samples with replacement in 
order to capture the idea of independence. < 


1.8 APPENDIX: STIRLING’S FORMULA 


An estimate of n! that is of importance both in numerical calculations and 
theoretical analysis is Stirling’s formula 


ni~mw ne-"s/ 27n 
in the sense that 


n! 


lim = | 


n> co (n"e-"/2nn) 
Proor. Define (2n)!! (read 2n semifactorial) as 2n(2n — 2)(2n — 4)--- 
6(4)(2), and (27 + 1)! !as Qn + 1)Qn — 1)--- (5)(3)(1). We first show that 


(2n)!! m(2n —1)!! | (2n — 2)!! 
(2n+1)!! ~2 (2n)!! (2n — 1)!! 


Let I, = Jj’? (cos x)" de, k = 0,1,2,.... Then I, = 7/2, I, = 1. Integrating 
by parts, we obtain I, = J7/? (cos x)*1 d(sin x) = f7/2 (k — 1)(cos x)*2 
sin? 2 dx. Since sin? « = 1 — cos? x, we have I, = (k — 1)l,_. — (k — DJ, 
or I, = [(kK — 1)/k]l,_2. By iteration, we obtain I,, = (7/2) [(Qn — 1)!!/ 
(2m) !!] and Tony, = [(2n)!!/(Qn + 1)!!]. Since (cos x)* decreases with k, 
so does f,,, and hence Ipn.4 < Ian < Tens, and (a) is proved. 


(a) 


(b) Let O, = ()/22". Then 


limO,J/n7 = 1 


To prove this, write 
_ Qn)! _ _(2n)! 
~ nlnt22™ (2"n!)? 
= (2n)! = (2n — 1)!! 
((2n)(2n — 2) - - - (4)(2))" (2n)!! 
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Thus, by (a), 
2n)!! 2n — 2)!! 
Qn)! my -Qn-2 
(Qn+1)!! 2 (2n — 1)!! 
Multiply this inequality by 


(Qn—1)!! (Q2n—1)!! (Qn)! 


(Qn—2)!! (Qn)! (Qn — 2)! 
to obtain 


2n 
2n + 1. 


If we let n — o0, we obtain n7Q,,? — 1, proving (b). 


<n7Q,* <1 


(c) Proof of Stirling’s formula. Let c, = n\[n"e-"/2mn. We must show 
that c, — 1 as n— oo. Consider (n + 1)!/n! =n + 1. We have 


(n+ 1)! Cnis(n + 1)"H4e-™),/9a(n + 1) 
n} c,n"e"/ 27m 


= ex o(" + *) (n + 1)°” 
C, n Jn 


i) (n a om o(} + 


Now (1 + 1/n)"*1? > e for n sufficiently large (take logarithms and expand 
in a power series); hence c,,,/c, < 1 for large enough n. Since every mono- 
tone bounded sequence converges, c, — a limit c. We must show c = 1. 


By (b), 


Thus 


Catt = (n + 1(e (a 


Cn 


in (i gen 
n 


n> 0 


But 


nin! oo (c,(nje)"s/ an) Qn) 2" c," 


Therefore c2,/c,2 — 1. However, co, —>c¢ and c,?—c*, and consequently 
c/c? = 1, so that c = 1. The theorem is proved. 


RemMaRK. The last step requires that c be >0. To see this, write 
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where c, is defined as 1. To show that c,, — a nonzero limit, it suffices 
to show that the limit of In c,,,, 1s finite, and for this it is sufficient to 
show that >, In (c,4:/c,) converges to a finite limit. Now 


Cc 1\'"+1/2) 1 
in “= = In| e(1 +) [=1-@+pm (14%) 
n n n 


Cc 
=1-@+D(,-55+75) 
n 


2n7n® 
where (7) is bounded by a constant independent of n. This is the order 
of 1/n?; hence >, In (c,41/c,) converges, and the result follows. 


Random Variables 


2.1 INTRODUCTION 


In Chapter 1 we mentioned that there are situations in which not all subsets 
of the sample space {2 can belong to the event class ¥, and that difficulties 
of this type generally arise when {2 is uncountable. Such spaces may arise 
physically as approximations to discrete spaces with a very large number of 
points. For example, if a person is picked at random in the United States and 
his age recorded, a complete description of this experiment would involve a 
probability space with approximately 200 million points (if the data are 
recorded accurately enough, no two people have the same age). A more 
convenient way to describe the experiment is to group the data, for example, 
into 10-year intervals. We may define a function g(%), x = 5, 15, 25,..., 
so that g(x) is the number of people, say in millions, between x — 5 and 
xz + 5 years (see Figure 2.1.1). 

For example, if g(15) = 40, there are 40 million people between the ages of 
10 and 20 or, on the average, 4 million per year over that 10-year span. Now 
if we want the probability that a person picked at random will be between 14 
and 16, we can get a reasonable figure by taking the average number of 
people per year [4 = g(15)/10)] and multiplying by the number of years (2) 
to obtain (roughly) 8 million people, then dividing by the total population to 
obtain a probability of 8/200 = .04. 


46 
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35 million people between 
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FicurE 2.1.1 Age Statistics. 


If we connect the values of g(x) by a smooth curve, essentially what we are 
doing is evaluating (1/200) J1§ [q(x)/10] dx to find the probability that a 
person picked at random will be between 14 and 16 years old. In general, 
we estimate the number of people between ages a and b by J? [g(x)/10] dx 
so that g(x)/10 is the age density, that is, the number of people per unit age. 
We estimate the probability of obtaining an age between a and Db by 
§2 [q(x)/2000] dx; thus q(x)/2000 is the probability density, or probability 
per unit age. Thus we are led to the idea of assigning probabilities by means 
of an integral. We are taking Q as (a subset of) the reals, and assigning 
P(B) = }z f(x) dz, where f is a real-valued function defined on the reals. 
There are several immediate questions, namely, what sigma field we are 
using, what functions f are allowed, what we mean by Jp f(x) dx, and how 
we know that the resulting P is a probability. 

For the moment suppose that we restrict ourselves to continuous or 
piecewise continuous f. Then we can certainly talk about |», f(x) dz, at 
least when B is an interval, and the integral is in the Riemann sense. Thus 
the appropriate sigma field ¥ should contain the intervals, and hence must 
be at least as big as the smallest sigma field @ containing the intervals (# 
exists; it can be described as the intersection of all sigma fields containing 
the intervals). The sigma field @ = AE) is called the class of Borel sets 
of the reals £1. Intuitively we may think of # being generated by starting 
with the intervals and repeatedly forming new sets by taking countable 
unions (and countable intersections) and complements in all possible ways 
(it turns out that there are subsets of E1 that are not Borel sets). 

Thus our problem will be to construct probability measures on the class of 
Borel sets of E*. The reason for considering only the Borel sets rather than 
all subsets of E1 is this. Suppose that we require that P(B) = Jz f(a) dx 
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for all intervals B, where fis a particular nonnegative continuous function 
defined on E+, and J®., f(x) de = 1. There is no probability measure on the 
class of all subsets of E+ satisfying this requirement, but there is such a meas- 
ure on the Borel sets. 

Before elaborating on these ideas, it is convenient to introduce the concept 
of a random variable; we do this in the next section. 


2.2 DEFINITION OF A RANDOM VARIABLE 


Intuitively, a random variable is a quantity that is measured in connection 
with a random experiment. If Q is a sample space, and the outcome of the 
experiment is w, a measuring process is carried out to obtain a number R(@). 
Thus a random variable is a real-valued function on a sample space. (The 
formal definition, which is postponed until later in the section, is somewhat 
more restrictive.) 


> Example 1. Throw a coin 10 times, and let R be the number of heads. 
We take Q = all sequences of length 10 with components H and T; 21° 
points altogether. A typical sample point is ow = HHTHTTHHATH. For 
this point R(w) = 6. Another random variable, R,, is the number of times 
a head is followed immediately by a tail. For the point w above, R,(w) = 3. < 


> Example 2. Pick a person at random from a certain population and 
measure his height and weight. We may take the sample space to be the 
plane E?, that is, the set of all pairs (7, y) of real numbers, with the first 
coordinate x representing the height and the second coordinate y the weight 
(we can take care of the requirement that height and weight be nonnegative 
by assigning probability 0 to the complement of the first quadrant). Let 
R, be the height of the person selected, and let R, be the weight. Then 
R,(x, y) = 2, R(x, y) = y. As another example, let R; be twice the height 


plus the cube root of the weight; that is, Rg = 2R, + v/ R,. Then R,(, y) = 
2R,(«", y) + W R,(x, Yy) = 2e + Wy. < 


> Example 3. Throw two dice. We may take the sample space to be the 
set of all pairs of integers (2, y), x, y = 1,2,..., 6 (36 points in all). 

Let R, = the result of the first toss. Then R,(x, y) = 2. 

Let R, = the sum of the two faces. Then R,(%, y) = x + y. 

Let R, = 1 if at least one face is an even number; R; = 0 otherwise. 

Then R,(6, 5) = 1; R,(3, 6) = 1; R,(1, 3) = 0, and so on. < 
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> Example 4. Imagine that we can observe the times at which electrons are 
emitted from the cathode of a vacuum tube, starting at time t= 0. Asa 
sample space, we may take all infinite sequences of positive real numbers, 
with the components representing the emission times. Assume that the 
emission process never stops. Typical sample points might be w, = (.2, 
1.5,6.3,...), @, = (01, .5, .9, 1.7,...). If Ry 1s the number of electrons 
emitted before ¢ = 1, then R,(@,) = 1, R,(@._) = 3. If R, is the time at which 
the first electron is emitted, then R,(w,) = .2, R,(w.) = .O1. < 


If we are interested in a random variable R defined on a given sample space, 
we generally want to know the probability of events involving R. Physical 
measurements of a quantity R generally lead to statements of the form 
a< R < 5b, and it is natural to ask for the probability that R will lie between 
a and 6 in a given performance of the experiment. Thus we are looking for 
P{o:a < R(w) < 5} (or, equally well, P{w: a < R(w) < b}, and so on). 
For example, if a coin is tossed independently n times, with probability p 
of coming up heads on a given toss, and if R is the number of heads, we 
have seen in Chapter 1 that 


b 
Plo: a < Rw) <b} => (7) pl — p)"* 
k= 
NOTATION. {w:a < R(w) < 5} will often be abbreviated to {a < R < dD}. 

As another example, if two unbiased dice are tossed independently, and 
R, is the sum of the faces (Example 3 above), then P{R, = 6} = P{(5, 1), 
(1, 5), (4, 2), (2, 4), (3, 3)} = 5/36. 

In general an “event involving R”’ corresponds to a statement that the value 
of R lies in a set B; that is, the event is of the form {w: R(qw) € B}. Intuitively, 
if P{w: R(w) € TD} is known for all intervals J, then P{w: R(w) € B} is deter- 
mined for any “well-behaved’’ set B, the reason being that any such set can 
be built up from intervals. For example, P{O< R<2 or R>3}(= 
P{R € [0, 2) U (3, 00)}) = P{O < R < 2} + P{R > 3}. Thus it appears that 
in order to describe the nature of R completely, it is sufficient to know 
P{R eI} for each interval J. We consider in more detail the problem of 
characterizing a random variable in the next section; in the remainder of 
this section we give the formal definition of a random variable. 


* For the concept of random variable to fit in with our established model 
for a probability space, the sets {a < R < b} must be events; that is, they 
must belong to the sigma field A. Thus a first restriction on R is that for all 
real a, b, the sets {w: a < R(w) < b} are in F. Thus we can talk intelligently 
about the event that R lies between a and b. 

A question now comes up: Suppose that the sets {a < R < b} are in F 
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for all a, b. Can we talk about the event that R belongs to a set B of reals, 
for B more general than a closed interval? 

For example, let B = [a, b) be an interval closed on the left, open on the 
right. Then 


a< R(o) < b iffa ZR Shea for at least onen = 1,2,... 
n 
“Thus 


{w:a < R(a) <b} = U [ora < R(a) < Za 
n=1 n 

and this set is a countable union of sets in ¥ , hence belongs to ¥. Ina similar 
fashion we can handle all types of intervals. Thus {w: R(w) € B} e F for 
all intervals B. 

In fact {w: R(w) € B} belongs to ¥ for all Borel sets B. The sequence of 
steps by which this is proved is outlined in Problem 1. 

We are now ready for the formal definition. 


DEFINITION. A random variable on the probability space (Q, F, P) is a real 
valued function R defined on Q, such that for every Borel subset B 
of the reals, {w: R(w) € B} belongs to F. 


Notice that the probability P is not involved in the definition at all; if 
Ris a random variable on (Q, F, P) and the probability measure is changed, 
R is still a random variable. Notice also that, by the above discussion, to 
check whether a given function R is a random variable it is sufficient to know 
that {w:a < R(w) < b} € F for all real a, b. In fact (Problem 2) it is suffi- 
cient that {w: R(w) < b} € F for all real b (or, equally well, {w: R(w) < b}eF 
for all real b; or {w: R(w) > a} € F for all real a; or {w: R(w) > adeF 
for all real a; the argument is essentially the same in all cases). 

Notice that if F consists of all subsets of Q, {w: R(w) € B} automatically 
belongs to ¥, so that in this case any real-valued function on the sample 
space is a random variable. Examples 1 and 3 fall into this category. 

Now let us consider Example 2. We take = the plane E?, A = the class 
of Borel subsets of E?, that is, the smallest sigma field containing all rec- 
tangles (we shall use “rectangle” in a very broad sense, allowing open, 
closed, or semiclosed rectangles, as well as infinite rectangular strips). 

To check that R, is a random variable, we have 


{(@,y):a< Rr, y) Sd} ={(@,y):acxr<¢d} 


which is a rectangular strip and hence a set in F. Similarly, R, is a random 
variable. For R;, see Problem 3. 
Example 2 generalizes as follows. Take Q = E” = all n-tuples of real 
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numbers, F the smallest sigma field containing the n-dimensional “inter- 
vals.” [Ifa = (a,,...,a,),5 = (b;,...,5,), the interval (a, b) is defined as 
{cE E™: a, <u,<b;, i=1,...,m}; closed and semiclosed intervals are 
defined similarly.] The coordinate functions, given by R,(%,...,2%,) = 
21, Ro(%1,...,U,) = %,...,R,(%1,...,2%,) = L,, are random variables. 

Example 4 involves some serious complications, since the sample points are 
infinite sequences of real numbers. We postpone the discussion of situations 
of this type until much later (Chapter 6). 


PROBLEMS 


*1. Let R be a real-valued function on a sample space ©, and let ¢ be the collection 
of all subsets B of E14 such that {w: R(w) € B} EF. 

(a) Show that @ is a sigma field. 

(b) If all intervals belong to ¢, that is, if {w: R(w)¢ B} eF when B is an 
interval, show that all Borel sets belong to @. Conclude that R is a random 
variable. 

*2. Let R be a real-valued function on a sample space 2, and assume {w: R(w) < 

b} €F for all real b. Show that R is a random variable. 

*3. In Example 2, show that R, is a random variable. Do this by showing that if 

R, and R, are random variables, so is R, + R,; if R is a random variable, 

so is aR for any real a; if R is a random variable, so is VR. 


2.3 CLASSIFICATION OF RANDOM VARIABLES 


If R is a random variable on the probability space (Q, ¥, P), we are gener- 
ally interested in calculating probabilities of events involving R, that is, 
P{w: R(w) € B} for various (Borel) sets B. The way in which these proba- 
bilities are calculated will depend on the particular nature of R; in this 
section we examine some standard classes of random variables. 

The random variable R is said to be discrete iff the set of possible values 
of R is finite or countably infinite. In this case, if 2, 2., .. . are the values of 
R that belong to B, then 


P{ReEB} = P{R= x2, or R=w4x, or sre 
= P{R = x4} + P{R = x} +--+: = > pp(*) 
2EB 


where pp(%), x real, is the probability function of R, defined by p(x) = 
P{R = x}. Thus the probability of an event involving R is found by summing 
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the probability function over the set of points favorable to the event. In 
particular, the probability function determines the probability of all events 
involving R. 


> Example 1. Let R be the number of heads in two independent tosses of a 
coin, with the probability of heads being .6 on a given toss. Take Q = 
{HH, HT, TH, TT} with probabilities .36, .24, .24, .16 assigned to the four 
points of Q; take ¥ = all subsets. Then R has three possible values, namely, 
0, 1, and 2, and P{R = 0} = .16, P{R = 1} = .48, P{R = 2} = .36, by 
inspection or by using the binomial formula 


P{R = k} - (7) pi — p)** < 


Another way of characterizing R is by means of the distribution function, 
defined by 
F,(«) = P{R < x}, x real 


(see Figure 2.3.1 for a sketch of Fz and pp in Example 1). 

Observe that, for example, P{R < 1} = pp(O) + pr) = .64, but if 
0< 2x <1, we have P{R < x} = pp(0) = .16. Thus Fp has a discontinuity 
at «= 1, of magnitude .48 = pp(1). In general, if R is discrete, and 
P{R = 2,} = p,,n=1,2,..., where the p, are >0 and >, p, = 1, then 
Fp has a jump of magnitude p, at = ,; Fp is constant between jumps. 

In the discrete case, if we are given the probability function, we can con- 
struct the distribution function, and, conversely, given Fp, we can construct 


F,(x) 
| 36 
48 
16 
0 1 2 ss 
P,(x) 
48 
36 
16 
x 
0 1 2 


FiGuRE 2.3.1 Distribution and Probability Functions of a Discrete Random Variable. 
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Pr- Knowledge of either function is sufficient to determine the probability 
of all events involving R. 

We now consider the case introduced in Section 2.1, where probabilities 
are assigned by means of an integral. 

Let f be a nonnegative Riemann integrablef function defined on E* with 
f°. f (2%) dz = 1. Take Q = E', ¥ = Borel sets. We would like to write, 
for each Be F 


P(B) = [ f(a) dx 


but this makes sense only if B is an interval. However, the following result is 
applicable. 


Theorem 1. Let f be a nonnegative real-valued function on E', with 
feu f(a) dx = 1. There is a unique probability measure P defined on the Borel 
subsets of E1, such that P(B) = Jp f(x) dx for all intervals B = (a, 5}. 


The theorem belongs to the domain of measure and integration theory, and 
will not be proved here. 

The theorem allows us to talk about the integral of f over an arbitrary 
Borel set B. We simply define |» f(x) dx as P(B), where P is the probability 
measure given by the theorem. 

The uniqueness part of the theorem may then be phrased as follows. If 
Q is a probability measure on the Borel subsets of E! and Q(B) = fpf (2) dx 
for all intervals B = (a, b], then Q(B) = Jz f(x) dz for all Borel sets B. 

If R is defined on Q by R(w) = a (so that the outcome of the experiment is 
identified with the value of R), then 


P{w: R(w) € B} = P(B) =| fe) dx 
B 
In particular, the distribution function of R is given by 


F p(x) = P{w: R(w) < x} = P(—o, 2] =|" f(t) at 


so that Fp is represented as an integral. 


DEFINITION. The random variable R is said to be absolutely continuous iff 
there is a nonnegative function f = fp defined on E* such that 


F p(#) =|" f(t) dt for all real x (2.3.1) 


fp is called the density function of R. We shall see in Section 2.5 that 
F p(x) must approach 1 as x — oo; hence §®,, f(x) de = 1. 


} “Integrable” will from now on mean ‘‘Riemann integrable.” 
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F(x) 


a b 


FiGureE 2.3.2 Distribution and Density Functions of a Uniformly Distributed Random 
Variable. 


>» Example 2. A number R is chosen at random between a and 5b; R is 
assumed to be uniformly distributed; that is, the probability that R will fall 
into an interval of length c depends only on c, not on the position of the 
interval within [a, db]. | 

We take Q = E!, % = Borel sets, R(w) = o, f(x) = fr(x) = 1/(b — a), 
a<«x<b;f(«)=0,2>5 ore < a. Define P(B) = Jp f(x) dz. In particu- 
lar, if B is a subinterval of [a, b], then P(B) = (length of B)/(b — a). The 
density and distribution function of R are shown in Figure 2.3.2. < 


Note. The values of Fp are probabilities, but the values of fe are not; 
probabilities are found by integrating fp. 


F(z) = P{R < x} =| fr(t) dt 
If Ris absolutely continuous, then 
b 
Pla<R<b} = | fala) de, a<b 


For {R < }} is the disjoint union of the events {R < a} and {a < 
R < b}; hence P{R < b} = P{R < a} + Pla < R < 5}. It follows 
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that 
Pla < R < b} = F(b) — F(a) (2.3.2) 


=| Salve) de —[" fale) de =| ful) de 


Thus, if Q(B) = P{R € B}, we have Q(B) = {zp fp(x) dx when B is 
an interval (a, b]. By Theorem 1, Q(B) = Jp fp(x) dx for all Borel 
sets B. Therefore, if R is absolutely continuous, 


P{ReEB} = | f(x) dx for all Borel sets B 
. 


The basic point is that the density function fp determines the probability 
of all events involving R. 


If R is absolutely continuous, then 
P{R=ch}= P{e< R<c} = | fala) de = 0 


The event {R = c} is in general not impossible; for example, if R is uniformly 
distributed between a and b, each event {R = x}, a< x < J, is possible; 
that is, the set {w: R(w) = x} is not empty. But the event {R = c} has 
probability 0. This does not contradict the axioms of probability. The 
definition of a probability measure requires that if the event A is impossible 
(i.e., A = @) then P(A) = 0; the converse need not be true. Intuitively, 
if R is uniformly distributed between a and b, it should be expected 
that all events {R = x}, a< x <b, will have the same probability. Any 
probability other than 0 will lead to a contradiction, since there are 
infinitely many points x between a and b. 

As a consequence of the fact that P{R = x} = 0 in the absolutely con- 
tinuous case, we have 


Pla<R<b}=Pla<R<b}=Pfa<cR<b} 
= Pla<R <b} 


= | Pian dx 


= F,(b) — F,(a) (2.3.3) 


Notice also that although in the discrete case the probability function of 
R determines the probability of all events involving R, in the absolutely con- 
tinuous case it gives no information at all, since pp(x) = P{R = x} = 0 for 
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all x. However, the distribution function of R is still adequate, since Fp 
determines fp. If fp is continuous, it may be obtained from Fz by differentia- 
tion; that is, 


é. [- Salt) dt = fx(@) 


(the fundamental theorem of calculus). The general proof that Fp determines 
Jr is measure-theoretic, and we shall not pursue it here. 

If fz is continuous, we have just seen that Fp is differentiable, and its 
derivative is fp. In general, if R is absolutely continuous, Fp(x) will be a 
continuous function of x, but again we shall not pursue this. 

We shall show in Section 2.5 that the distribution function of an arbitrary 
random variable must be nondecreasing [a < b implies Fp(a) < FR(d)I, 
must approach 1 as x — oo, and must approach 0 as x > — o. 


> Example 3. Let R be time of emission of the first electron from the 
cathode of a vacuum tube. Under certain physical assumptions, it turns out 
that R has the following density function: 


fp(t) = de**, =e > 0 
a (A constant) 


= 0) x<0 
(see Figure 2.3.3). 
fal) ‘ | 
F(x) -| frp) at 

so Fp(x) = 0,4 < 0, 

NeW and if x > 0, 

x 

0 

F(x) F(x) = | fp) at 


x x 
So | fp dt = | he~** dt 
0 0 


=|]—e¢% 


FIGuRE 2.3.3 Exponential Density 
and Distribution Functions. 
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(R —1)(R—- 2) 


0 1 2 # 


FIGuRE 2.3.4 Calculation of Probabilities. 


We calculate some probabilities of events involving R: 


PicR<2}= | he da = et — = F,(2) — F2(1) 
P{((R—1(R—2)>0}=P{R<1 or R>2} 
= P{IR<1}+ P{R>2} 


1 oa) 
=| Ae** dx +{ de~** dx 
0 2 


ar 1 = eA + eA 


(see Figure 2.3.4). < 
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REMARK. You will often see the statement “Let R be an absolutely con- 
tinuous random variable with density function /,” with no reference 
made to the underlying probability space. However, we have seen 
that we can always supply an appropriate space, as follows. Take 
Q = E', ¥ = Borel sets, P(B) = fz f(x) dz for all Be F. If R(w) = 


w, w € Q, then R is absolutely continuous and has density f- 


In a sense, it does not make any difference how we arrive at 2 
and P; we may equally well use a different Q and P and a different R, 
as long as R is absolutely continuous with density f/ No matter what 


construction we use, we get the same essential result, namely, 


P{R € B} =| s@) dx 


Thus questions about probabilities of events involving R are answered 


completely by knowledge of the density f- 
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PROBLEMS 


1. An absolutely continuous random variable R has a density function f(%) = 
(1/2)e—lel. 
(a) Sketch the distribution function of R. 
(b) Find the probability of each of the following events. 


(1) {[R| <2} (5) {R? — R27 - R-~2<0} 

(2) {|R| <2 or R > 0} (6) {esin 7R > 1} 

(3) {/R| <2 and R < —1} (7) {R is irrational} (= {w: R(@) is 
(4) {|R| + |R — 3| < 3} an irrational number}) 


2. Consider a sequence of five Bernoulli trials. Let R be the number of times that a 
head is followed immediately by a tail. For example, if o = HHTHT then 
R() = 2, since a head is followed directly by a tail at trials 2 and 3, and also 
at trials 4 and 5. Find the probability function of R. 


2.4 FUNCTIONS OF A RANDOM VARIABLE 


A general problem that arises in many branches of science is the following. 
Given a system of some sort, to which an input is applied; knowledge of 
some of the characteristics of the system, together with knowledge of the 
input, will allow some estimate of the behavior at the output. We formulate a 
special case of this problem. Given a random variable R, on a probability 
space, and a real-valued function g on the reals, we define a random variable 
R, by R, = g(R,); that is, R.(w) = g(R,(@)), w € Q. R, plays the role of the 
input, and g the role of the system; the output R, is a random variable 
defined on the same space as R,. Given the function g and the distribution or 
density function of the random variable R,, the problem is to find the distri- 
bution or density function of Rg. 


Note. If &, isa random variable and we set R, = g(R,), the question arises 
as to whether R, is in fact a random variable. The answer is yes if 
g is continuous or piecewise continuous; we shall consider this prob- 
lem in greater detail in Section 2.7. 


> Example 1. Let R, be absolutely continuous, with the density f/, given in 
Figure 2.4.1. Let R, = R,?; that is, R.(w) = R,°?(w), w € Q. Find the distri- 
bution or density function of Rg. 

We shall indicate two approaches to the problem. 
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R= Ry 


F(x) 


—Vy 0 VW Ry 


FIGURE 2.4.1 FIGURE 2.4.2 


DISTRIBUTION FUNCTION METHOD. In this method the distribution func- 
tion F, of R, is found directly, by expressing the event {R, < y} in terms of 
the random variable R,. First, since R, > 0, we have F,(y) = P{R, < y} = 0 
for y < 0. 


If y > 0, then R, < y iff sly < Ri < Jy (see Figure 2.4.2). Thus, if 
y = 0, 


_ _ Vy 
PIR, <u} = P{-V¥ si <=] "fae 
In particular, if0 < y < 1, then -_ 


Fy) = | C fi(2) dx = | 


: 1d UF tae 22 1 — V9 
vo % -F Ze se A /y qf a(l 2 ) 
y 


0 


(see Figure 2.4.3). 
Ify> 1, 


Fy) =| 500 de =| 


= 0 Vy 
Ode +| Cap [ he-® da 
Vy /0 


—1 
=$+40-0e°%% 


(see Figure 2.4.4). A sketch of F, is given in Figure 2.4.5. 


f(x) 


FIGURE 2.4.3 
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f(x) 


—Vy -1 0 1 «wf 


FIGURE 2.4.4 


We would like to conclude, by inspection of the distribution function Fy, 
that the random variable R, is absolutely continuous. We should be able to 
find the density /, of R, by differentiating Fy. 


dy 
1 Vy 
a —-Gae%, Oye 
4./y 
ey) y>Il 


(see Figure 2.4.6). 

It may be verified that F,(y) is given by J” | f(z) dt, so that f, is in fact the 
density of R,. Thus, in this case, if we differentiate F, and then integrate the 
derivative, we get back to Fy. 

It is reasonable to expect that a random variable R, whose distribution 
function is continuous everywhere and defined by an explicit formula or 
collection of formulas, will be absolutely continuous. The following result 
will cover almost all situations encountered in practice. 


F,(y) 


2 VI +3 (1 - eV) 


FIGURE 2.4.5 
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f,(y) 
0 1 - 
FIGURE 2.4.6 


Let R be a random variable with distribution function F. Suppose that 


(1) Fis continuous for all x 
(2) Fis differentiable everywhere except possibly at a finite 

number of points (2.4.1) 
(3) The derivative f(x) = F’(2) is continuous except possibly at 

a finite number of points 


Then R is an absolutely continuous random variable with density function f. 
(The proof involves the application of the fundamental theorem of calculus; 
see Problem 6.) 


Note. Density functions need not be continuous or bounded. Also, in 
this case there is an ambiguity in the values of f,(y) at y = 0 and 
y = 1, since F, is not differentiable at these points. However, any 
values may be assumed, since changing a function at a single point, 
or a finite or countably infinite number of points, or in fact on a set 
of total length (Lebesgue measure) zero, does not change the integral. 


DENSITY FUNCTION METHOD. In this approach we develop an explicit 
formula for the density of R, in terms of that of R,. 

We first give an informal description. The probability that R, will lie in 
the small interval [y, y + dy] is 


y+dy 


fa{t) dt 


y 
which is roughly Sly) dy if f, is well-behaved near y. But if we set h,(y) = 
a y, hy) = inl y, y > 0, then (see Figure 2.4.7) 
Plysc Ri cyt agyj=Phy) << Ri < Ahly + ay)} 
+ Pthy + dy) < Ri < Aly} 
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Ro 
y + dy 
Ry 
ho(y + dy) ‘i hy (y + dy) 
he(y) = —-~/¥ Vy = Ay (y) 
FiGureE 2.4.7 


Hence 
fly) dy = f,(hi(y)) a 
+ fi(hg(y)) U2 = hey + ayy] dy 


dy 
Let dy — 0 to obtain 


Fay) = AiCA(yhiy) + Ahoy) (— hay) 
= fi(hy(y)) lhi(y)! + fi(he(y)) |haty)| 


In this case 
=o 


= air _mif2— le = 5 ie 
fey) = fl oN) + hi 7) _ vil gil +O) 


Now (see Figure 2.4.1), if 0 < y < 1, 


= 1o-Vy 4 ap 
fly) [ze + 3] 4 


1 
2./y y 
If y > 1, v V 

Lee 1 vy 
= —= ls 0 . 
fly) ay + 0] rar 


as before. 
Similar reasoning shows that in general 


fly) = Alay) ln +: +ACi(y)) ha! 


where h,(y),...,h,(y) are the values of R, corresponding to R, = y. 

Here is the formal statement. Suppose that the domain of g can be written 
as the union of intervals J, /;,...,JZ,. Assume that over the interval J,, 
g is strictly increasing or strictly decreasing and is differentiable (except 
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possibly at the end points), with 4; = inverse of g over J;. Let F, satisfy the 
three conditions (2.4.1). Then 


fly) => ACALy)) |hYY)| 
where = 
flhiy)) |hj(y)| 
is interpreted as 0 if y ¢ the domain of h;. For the proof, see Problem 7. 


REMARK. If we have R, = A(R,), where hf is a one-to-one differentiable 
function, and R, = g(R,), where g is the inverse of h, then 


/ peas i 
a me a=h(y) 
Thus we may write 
fy) == LD) 


j=1 le’(2)le—nsw) 
where R, = g(R)). 


In the present example we have 


g(x) ie 20 h,(y) aa JY: hy) ae —,/y 


so that 
AD, fil-v) 


fly) = = . = 
; 22|,-Vy [2tlp_vy VY 
as before. < 


LAG/y) + A(-—V 9)! 


p> Example 2. Let R, be uniformly distributed between 0 and 277; that is, 
AC ee O< x< 27 
27 
= 0 elsewhere 


Let R, = sin R, (see Figure 2.4.8). 


Ro Ro 


2nr +sin-1 y 


w—sin-ly 


FIGURE 2.4.8 
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DISTRIBUTION FUNCTION METHOD. If0< y < 1, 


F.(y) = P{R, = y} 
= P{0 < R, < sin *y} + P{a — sinty < R, < 27} 


sin ty Pa 
= [ a dx + [ ae dix 
0 27 a—sin ty 27 


Be, Wee a 
= S10 
2. * 2a : 


where the branch of the arc sin function is chosen so that —7/2 < sin y < 
a/2. 


F,(y) = P{a — sin y < R, < 27 + sin y} 
1 1. 
=- + —sin™ 
2. a 4 


as above, and 


iQ\shiGQje = =e, “tey24 


al 1 — y” 
DENSITY FUNCTION METHOD. If 0 < y < l, 


fay) = fulsin- 9) [© si y| + far — sin 9) | 2 (me — sin 
dy dy 


1 


aml 1—y 
Similarly, 


1 
fly) = = _~—sifor -1<y<0 


a1 — y" 
[fo(y) = 0, |y| > 1] (see Figure 2.4.9). < 


f(y) 


—1 ) 1 
FIGURE 2.4.9 
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~PROBLEMS 
1. Let R, be absolutely continuous with density 


fi@ =e”, x >0; fi@) =9, x<0 


Define 
R,=R, if R, <1 
1 


Show that R, is absolutely continuous and find its density. 


2. An absolutely continuous random variable R, is uniformly distributed between 
—1 and +1. Find and sketch either the density or the distribution function of 
the random variable Rj, where R, = e—*:, 


3. Let R, have density fi(x) = 1/x?, x > 1; ff@) =0, x < 1. Define 


R, =2R, for R, <2 
ae R;’ for R, > 2 
Find the density of Rp. 
4. Let R, be as in Problem 3, and define 


R, = 2R, for Ry <2 
= 5§ for R, >2 


Find and sketch the distribution function of R,; is Re absolutely continuous? 
5. (a) Let R, have distribution function 


F(z) =1 -e”, x >0 
= 0, x<Q 

Define 
R, = 1 — efi, R, =0 


= 0, R, <0 


Show that R, is uniformly distributed between 0 and 1. 

(b) In general, if a random variable R, has a continuous distribution function 
g(x) = F,(@) and we define a random variable R, by Re = g(R,), show that 
R, is uniformly distributed between 0 and 1. 


6. If R is a random variable with distribution function F, where F is continuous 
everywhere and has a continuous derivative fat all but a finite number of points, 
show that R is absolutely continuous with density f- 
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7. Establish the validity of the formula 


n 


fay) = > fiG@,;Y) lh; (y)| 


j=1 
under the conditions given in the text. 

8. Let R, be chosen at random between 0 and 1, with density f, [so that J} f(y) dy = 
1]. Let Ry be the second digit in the decimal expansion of R,. (To avoid ambiguity, 
write, for example, .3 as .3000--- , not .2999 ---.) 

(a) Show that R, =k iff i+ 107k < 10R, <i+10 (Kk + 1) for some i = 
0,1,...,9. Hence 
9 (10 %4+10-*k+10 7 
P{R, =k} => fig@d, k=0,1,...,9 
t=0 J10 1¢+10*k 
(b) If R is uniformly distributed between 0 and 1, and R, = VR, find the 
probability function of R, = the second digit in the decimal expansion of 
Ry. 

9. A projectile is fired with initial velocity vy at an angle 6 uniformly distributed 

between 0 and 7/2 (see Figure P.2.4.9). If R is the distance from the launch site 


<—$___ R —______> 


FIGURE P.2.4.9 


to the point at which the projectile returns to earth, find the density of R (consider 
only the effect of gravity). 


2.5 PROPERTIES OF DISTRIBUTION FUNCTIONS 


We shall establish some general properties of the distribution function of an 
arbitrary random variable. We need two facts about probability measures. 


Theorem 1. Let (Q, F, P) be a probability space. 
(a) If A,, Ag, ... is an expanding sequence of sets in F , that is, A, © Any 
for alln, and A = Y_, A,, then P(A) = lim, _,, P(A,). 


n=1 > 
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FIGURE 2.5.1 Expanding Sequence. 


(b) If Ay, Ag, .. . is a contracting sequence of sets in F, that is, Ani, © An 
for alln, and A = (\°_, An, then P(A) = lim,_,, P(A,). 


PROOF. 
(a) We can write 


A = Ay U (Ay — Ay) U (Ag = 4g) UU (Ag = Aga) 


(see Figure 2.5.1; note this is the expansion (1.3.11) in the special case of 
an expanding sequence). Since this is a disjoint union, 


P(A) = P(A;) + P(A, — Ay) + P(A3 — As) + °° 
= P(A) + P(Ag) — P(A;) + P(A3) — P(Ag) +°°* = since A, © Anat 
= lim P(A,) 


(b) If A=)?_, An, then, by the DeMorgan laws, A°= U7, A,’ 
Now 4,41 © A,; hence A,° © A’ ,,. Thus the sets 4,° form an expanding 
sequence, so, by (a), P(A,°) > P(A‘); that is; 1 — P(A,) > 1 — P(A). The 
result follows. 


Theorem 2. Let F be the distribution function of an arbitrary random 
variable R. Then 


1. F(x) is nondecreasing; that is, a < b implies F(a) < F(6) 
For we have shown [see (2.3.2)] that F(b) — F(a) = Pla< R<b}> 0. 


2. limF(x) =1 
Let w,, n = 1,2,... be a sequence of rea! numbers increasing to + oo. Let 
A, ={R<«x,}. Then the A, form an expanding sequence. (Since x, < 
Trii, R <x, implies R < 2,4.) Now U™_, A, = Q, since, given any point 
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FIGURE 2.5.2 Right Continuity of Distribution Functions. 


wo €Q, R(w) is a real number; hence, for sufficiently large n, R(w) < 2,, 
so that w € A,. Thus P(A,) > PQ) = 1, that is, lim, F(#,) = 1. 


3. lim F(x) = 0 
Let x,, n= 1,2,... be a sequence of real numbers decreasing to —oo. 
Let A, = {R<72,}. Then the A, form a contracting sequence. (Since 
Tni1 <S L,, R<X,4, implies R< z,.) Now ()”_, A, = ©, since if w is 
any point of 2, R(w) cannot always be <z, because x, — —oo. Thus 
P(A,,) > P(@) = 0; that is, F(x,) > 0. 


4. F is continuous from the right; that is, lim, ,,,+ F(«) = F(%9) 


Hence F assumes the upper value at any discontinuity; see Figure 2.5.2. 


Let x, approach 2, from above; that is, let x,,n = 1,2,... be a (strictly) 
decreasing sequence whose limit is 2. As before, let A, = {R < x,}. The 
A, form a contracting sequence whose limit (intersection) is A = {R < Xp}. 
In order to show that ()7_1, 4A, = {R < %}, we reason as follows. If 
R(w) < 2, for all n, then, since x, > 2%, R(w) < x. Conversely, if R(w) < 
2%, then, since x, < x, for all n, R(w) < x, for all n. Thus P(A,,) — P(A); 
that is, F(x,) > F(a»). 

5. lim F(x) = P{R < 2} 
[We write F(2,) for lim, ,,,- F(«).] 

Let z,,n =1,2,... be a (strictly) increasing sequence whose limit is 2p. 
Again let A, = {R < x,}. The A, form an expanding sequence whose union 
is {R < x}. To show U?_1 4, = {R < 2}, we reason as follows. If we 
some A,, then R(w) < x,, so that R(w) < x. Conversely, if R(w) < 2p, 
then, since x, — 2, eventually R(w) <2,, so that we U?_1A,. Thus 
P(A,,) > P{R < 2%}, and the result follows. 


6. P{R = x} = F(a) — F(a) 
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Thus F is continuous at 2, iff PLR = 2} = 0, and if Fis discontinuous at 2, 
the magnitude of the jump is the probability that R = 2p. 
For P{R < x} = P{R < 2%} + P{R = 2}, so that 


F(a) = F(%y) + P{R = 2} 


REMARK. The random variable R is said to be continuous iff its distribution 
function F'p(x) is a continuous function of z for all x. In any reasonable 
case a continuous random variable will have a density—that is, it 
will be absolutely continuous—but it is possible to establish the 
existence of random variables that are continuous but not absolutely 
continuous. 


7. Let F be a function from the reals to the reals, satisfying properties 
I, 2, 3, and 4 above. Then F is the distribution function of some random vari- 
able. 

This is a somewhat vague statement. Let us try to clarify it, even though we 
omit the proof. What we are doing essentially is making the statement “‘Let 
R be a random variable with distribution function F.” It is up to us to supply 
the underlying probability space. As we have done before, we take Q = E£?, 
# = Borel sets, R(w) = w. Now if F is to be the distribution function of R, 
we must have, for a < b, 


P(a, b] = Pla< R <b} = F(b) — F(a) by (2.3.2) 


It turns out that if F satisfies conditions 1-4, there is a unique probability 
measure P defined on the Borel subsets of E! such that P(a, b] = F(b) — F(a) 
for all real a, b, a < 5; thus the probabilities of all events involving R are 
determined by F. If we let a» — oo, we obtain P(— oo, b] = F(b), that is, 
P£{R < b} = F(b), so that in fact F is the distribution function of R. In the 
special case in which F(x) = J*., f(t) dt, where fis a nonnegative integrable 
function and {%, f(«) de = 1, P(a, b] = F(b) — F(a) = J? f(x) dx. This is 
exactly the situation we considered in Theorem 1 of Section 2.3. 


PROBLEMS 


1. Let R be a random variable with the distribution function shown in Figure 
P.2.5.1; notice that R is neither discrete nor continuous. Find the probability 
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Fr(x) 


FIGuRE P.2.5.1 


of the following events. 


(a) {R = 2} 

(b) {R < 2} 

(c) {R =2o0r.5 < R < 1.5} 
(d) {R =2o0r.5 <R < 3} 


2. Let R be an arbitrary random variable with distribution function F. We have 
seen that Pfa < R < b} = F(b) — F(a), a < b. Show that 


Pla < R <b} = FO) —- Fr) 
Pla<R<bd =F) -F(x) 
Pla<R <b} =F(D) — F(a) 


(Of course these are all equal if F is continuous at a and b.) 


2.6 JOINT DENSITY FUNCTIONS 


We are going to investigate situations in which we deal simultaneously with 
several random variables defined on the same sample space. As an intro- 
ductory example, suppose that a person is selected at random from a certain 
population, and his age and weight recorded. We may take as the sample 
space the set of all pairs (7, y) of real numbers, that is, the Euclidean plane 
E*, where we interpret x as the age and y as the weight. Let R, be the age of 
the person selected, and R, the weight; that is, R,(v, y) = z, R(x, y) = y. 
We wish to assign probabilities to events that involve R, and R, simul- 
taneously. A cross-section of the available data might appear as shown in 
Figure 2.6.1. Thus there are 4 million people whose age is between 20 and 25 
and (simultaneously) whose weight is between 150 and 160 pounds, and so 
on. Now suppose that we wish to estimate the number of people between 22 
and 23 years, and 154 and 156 pounds. There are 4 million people spread over 
5 years and 10 pounds, or 4/50 million per year-pound. We are interested in 
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weight 
160 


156 —_— 
154 — 


150 —_— 


age age 


20 22 23 25 


FIGURE 2.6.1 Age-Weight Data (Num- —‘ FIGURE 2.6.2 Estimation of Probabilities. 
ber of People Is in Millions). 


a range of 1 year and 2 pounds, and so our estimate is 4/50 x 1 x 2 = 8/50 
million (see Figure 2.6.2). If the total population is 200 million, then 


P{22 < R, < 23, 154 < R, < 156} 


should be approximately 


Je = .0008 
200 


NoTaTION. {22 < R, < 23, 154 < R, < 156} means {22 < R, < 23 and 
154 < R, < 156}. 


What we are doing is multiplying an age-weight density 4/50 by an area 
1 x 2 to estimate the number of people or, equally well, a probability density 
4/[50(200)] by an area (1 x 2) to estimate the probability. 

Thus it appears that we should assign probabilities by means of an integral 
over an area. Let us try to construct an appropriate probability space. We 
take QO = E?, A = the Borel subsets of E*. Suppose we have a nonnegative 
real-valued function f on E?, with 


[EfEs (x, y) du dy = 1 


Theorem 1 of Section 2.3 holds just as well in the two-dimensional case; 
there is a unique probability measure P on ¥ such that P(B) = ffp f(x) dx 
for all rectangles B. 

If we define R,(x, y) = x, R,(x, y) = y, then 


P{(Ry, Ry) € B} = P(B) = [fe y) de dy 
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For example, 


b ad 
Pla<R<bc<R<d=| | fla,ydedy 
e=aJy=c 


The joint distribution function of two arbitrary random variables R, and 
R, is defined by 
F,(@, y) = PUR, S *, Re < y} 


In the present case we have 
x y 


Fy.(«, y) =| f(u, v) du dv 


U=— C0 J V=— 00 

In general, if R, and R, are arbitrary random variables defined on a given 

probability space, the pair (R,, R.) is said to be absolutely continuous iff 
there is a nonnegative function f = f,, defined on E? such that 


x y 
F,.(#, y) = | [ fio(u, v) du dv for all real x, y (2.6.1) 


fiz is called the density of (R,, Rz) or the joint density of R, and Ro. 
Just as in the one-dimensional case, if (R,, R,) is absolutely continuous, it 
follows that 


P{(Ry, Re) € B} =] | fio(, y) dx dy 
( 


for all two-dimensional Borel sets B (see Problem 1). Again, as in the one- 
dimensional case, if fis a nonnegative function on E? with 


{ f(x, y) dx dy =1 


we can always find random variables R,, R, such that (R,, R,) is absolutely 
continuous with density f We take Q = E?, ¥ = Borel sets, R,(x, y) = 2, 
R(x, y) = y, P(B) = Sle f(a, y) de dy. Even if we use a completely different 
construction, we get the same result, namely, 


P{(Rj, Ry) € B} = | | f(a, y) de dy 
B 


We have a similar situation in m dimensions. If the m random variables 
Ri, Rz,..., R, are all defined on the same probability space, the joint 
distribution function of R,, Ro, ..., R, is defined by 


Fis. n(%41, i ai > Xn) ee PLR, < X15 ia Ry < Ln} 


The random vector or n-tuple (R;,..., R,) is said to be absolutely con- 
tinuous iff there is a nonnegative function f..., defined on E”, called the 
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density of (Ri,..- } R,,) or the joint density of R,,..., R,, such that 


vy 


Fie. p(X, ei Tn) | [ hiz...n(ta, ieee Un) du, cee, du, (2.6.2) 
for all real z,,...,2 


? n° 


Notice that f,...., can be recovered from F,,_,, by differentiation: 


O"Fi2., n( X15 See 9 Ln) 


=fie..n(y- ++» 
Ox, ++ 02, 12...n\°"1 n) 
at least at points where /,._, is continuous. 

If (R,,..., R,) is absolutely continuous, then 


P{(R,; a a R,) E B} =|. . ence eee 5 Ln) Ax, ai AL, 
B 


for all n-dimensional Borel sets B. 
If fis a nonnegative function on E” such that 


| | f(y, ..., €,) da,°++ dx, =1 


we can always find random variables R,,...,R, such that (R,,..., R,) 
is absolutely continuous with density f. We take Q = E”, F = Borel sets, 
and define R,(%,...,%,) = %,...,R,(%1,...,%,) = x,. If Bis any Borel 
subset of E”, we assign 


P(B) = [oo [fen 2) days de, 
B 
Then (R,,..., R,) is absolutely continuous with density f- 


> Example 1. Let 


fiol*, y) = 1 focgz¢<l and O<y<l 
= 0 elsewhere 


(This is the uniform density on the unit square.) We may as well take Q = E?, 
YF = Borel sets, R(x, y) = x, R(x, y) = y, 
P(B) = | | fala, v) de dy 
B 


Let us calculate the probability that 1/2 << R, + R, < 3/2. Now 


={(@,y):¢SUt+y<s 3} 
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Rol 
& 
| 
x 


2 
FIGURE 2.6.3 FIGURE 2.6.4 
Calculation of P{4 < Ry + R,<%}.  —Caleulation of P{R, > R, > 2}. 
Thus (see Figure 2.6.3) 
PASR+R<H= [[ fulswdedy 


1/2<a2+4<3/2 


= {| 1 dx dy = shaded area 


shaded area 
=1-2@)=2 
If we want the probability that 1/2 < R, < 3/4and0 < R, < 1/2, we obtain 


3/4 1/2 
PUSR<bOSR<H=| | 1drdy=1)-3< 
e=1/2 Jy=0 


> Example 2. Let 
Aty=e"™, xy 20 


= 0 elsewhere 


Let us calculate the probability that R, > R, > 2. We have (see Figure 2.6.4) 


P{R, > R,>2}= | fral, y) dx dy 


x2->y=2 


=| e* de| e "dy =| e*(e" — ee) dx 
2 2 2 


= et 


—te*=te*< 
To summarize: 


P{(R,, Rz) € B} = fil%, y) dx dy 
Ji. 


The probability of any event is found by integrating the density function 
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over the set defined by the event. This is perhaps about as close as one can come 
to a one-sentence summary of the role of density functions in probability 
theory. 


PROBLEMS 


1. Let Fy, be the joint distribution function of R, and R,, where (R,, R,) is absolutely 
continuous with density fj2. Show that 


: bi [be 
Pla, < Ry < by, a, < Ry < bo} = | | fi2(@, y) dx dy 
a1, v2 


The uniqueness part of Theorem 2.3.1 (generalized to two dimensions) shows that 
P{(R,, R,) E B} = [[ race y) dx dy 
B 


for all two-dimensional Borel sets B. 

HINT: If F is the joint distribution function of the random variables R, and Re, 
show that 

Play < Ry < by, dg < Re < bg} = F(by, by) — F(Qy, by) — F(by, ag) + F(Qy, a2) 


2. If F is the joint distribution function of the random variables R,, R,, and Rg, 
express 
Pia, < R, < by, ay < R, < by, ag < Rg < bs} 


in terms of F. Can you see a general pattern that will extend this result to n 
dimensions ? 


3. If 
F(x,y) =1 forx +y>0 


= 0 forxz+y <0 
(see Figure P.2.6.3.), show that F cannot possibly be the joint distribution 


FIGURE P.2.6.3 
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function of a pair of random variables (see Problem 1.) 
4. Let R, and R, have the following joint density: 
fo@y=% if-l<e<1 and -l<y<l 
=0 elsewhere 


(This corresponds to R, and R, being chosen independently, each uniformly 
distributed between —1 and +1; we elaborate on this in the next section.) 
Find the probability of each of the following events. 
(a) {R, + Re <3} 
(b) {Ry — Ry <3} 
(c) {RiR, <4} 
@ {FP <5 
R <2 
1 
(e) | < 3 
(f) {IRil + 1Ral < 1} 
(g) {|Rel < e®1} 


Ry 
Ry 


2.7 RELATIONSHIP BETWEEN JOINT AND INDIVIDUAL 
DENSITIES; INDEPENDENCE OF RANDOM VARIABLES 


If R, and R, are two random variables defined on the same probability 
space, we wish to investigate the relation between the characterization of the 
random variables individually and their characterization simultaneously. 
We shall consider two problems. 

1. If(R,, R.)is absolutely continuous, are R, and R, absolutely continuous, 
and, if so, how can the individual densities of R, and R, be found in terms of 
the joint densities ? 

2. Given R,, R, (individually) absolutely continuous, is (R,, R,) absolutely 
continuous, and, if so, can the joint density be derived from the individual 
density ? 


Problem 1 


To go from simultaneous information to individual information is 
essentially a matter of adding across a row or column. For example, suppose 
that a group of 14 people has the age-weight distribution shown in Figure 
2.7.1. The number of people between 20 and 25 years is found by adding the 
numbers in the first column; thus 4 + 2 = 6. 

Let us develop this idea a bit further. If R, and R, are discrete, the joint 
probability function of R, and R, [or the probability function of the pair 
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weight 
170 
| 2. | 3 | 
160, ————-—__-___—_ 
s| 3 
150 age 
20 25 30 


FIGURE 2.7.1 Calculation of Individual Probabilities from Joint Probabilities. 


(R,, R,)] is defined by 
Pi(*, y) = P{R, = 2, R= y} x, y real (2.7.1) 
If the possible values of R, are y1, ¥2,..., then 
{Ry = 2} oe {Ry Se Ree) VR HS 2, Ry = Gs) Ue 


since the events {R, = y,}, m = 1,2,... are mutually exclusive and exhaus- 
tive. Thus the probability function of R, is given by 
Pi(%) = P{R, = x} = > p,2(2, y) (2.7.2) 
7] 
Similarly, 
PAY) = PLR, = y} = > Pr2(*, Y) (2.7.3) 


There are analogous formulas in higher dimensions, for example, 
Pil%, ¥) = >, Praa(*, Y, 2) ply) = > Prra(%, Y, 2) 


where Pyo3(%, y, 2) = P{R, = @, R, = y, Rg = 2}. 

Now let us return to the absolutely continuous case. If (R,, Re) is ab- 
solutely continuous with joint density /{,, we shall show that R, is absolutely 
continuous (and so is R,) and find f, and f, in terms of fqo. 

For any 2% we have, intuitively, 


Pity < Ry < Ly + dty} ~ fy (Xp) dx (2.7.4) 
But 


Pix < Ry SX + dx} = Pix < Ry < X% + d%, —0 < Ry < co} 
Lotdxo ee) 
= | as | fio(%, y) dy 


wv, 


(see Figure 2.7.2). 
If fq. is well-behaved, this is approximately 


Axo [ “Sits y) dy (2.7.5) 
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FIGURE 2.7.2 Calculation of Individual Densities from Joint Densities. 


From (2.7.4) and (2.7.5) (replacing x, by 2) we have 


f(a) =|" fils ay 
To verify this formally, we work with the distribution function of R,. 


F,() = P{R, < ty} = P{R, <%,-O< RA < oo} 
2 [ faa(2, 9) ay | de 


=—o | Jy¥=—o 


Thus F, is represented as an integral, and so R, is absolutely continuous with 
density 


f(2) = [ ” fral®, y) dy (2.7.6) 


Similarly, 
fly) = [ ” falts i) dz (2.7.7) 


In exactly the same way we may establish similar formulas in higher dimen- 
sions; for example, 


fists #) = | faa, 4,2) de (2.7.8) 
flo) =|" | faa, v2) de de (2.7.9) 
The process of obtaining the individual densities from the joint density is 


sometimes called the calculation of marginal densities, because of the simi- 
larity to the process of adding across a row or column. 
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y 


FIGURE 2.7.3 


> Example 1. Let 
file, y) = 8ry, OXy<egl 


= 0 elsewhere 
(see Figure 2.7.3). 


f(2) = { “fale 9) dy 


= 0 ifzx#<0O or «>1 


fo<z¢<l, 
f,(*) = { Say dy = 42° (Figure 2.7.4a) 
0 
fila) = | fess 9) a 
== () ify<0O or y>l 
fo<cy<l, 


1 
Fly) “| Sx2y dx = 4y(1 — y”) (Figure 2.7.4b) 


Sketches of f, and f, are given in Figure 2.7.5. < 


FIGURE 2.7.4 
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(2) f(y) 


Ay(1 — 2) 


FIGURE 2.7.5 


Problem 2 


The second problem posed at the beginning of this section has a negative 
answer; that is, if R, and R, are each absolutely continuous then (R,, R2) 
is not necessarily absolutely continuous. Furthermore, even if (R,, R.) is 
absolutely continuous, f4(%) and f,(y) do not determine /,.(7, y). We give 
examples later in the section. 

However, there is an affirmative answer when the random variables are 
independent. We have considered the notion of independence of events, and 
this can be used to define independence of random variables. Intuitively, the 
random variables R,,..., R, are independent if knowledge about some of 
the R,; does not change the odds about the other R,’s. In other words, if 
A, is an event involving R, alone, that is, if A; = {R,; € B,}, then the events 
A,,...,A, should be independent. Formally, we define independence as 
follows. 


DEFINITION. Let R,,..., R, be random variables on (Q, F, P). Ry,...,R, 
are said to be independent iff for all Borel subsets B,,..., B, of E" 
we have 


P{R,€ B,,...,R,€B,} = P{R,€ By}--+- P{R, € B,} 


REMARK. If R,,...,R, are independent, so are R,,...,R, for k <n. 
For 


P{R,€ B,,..., R,€ B,} = P{R,€ By, ..., R, € B,, 
—0o < Ry < 0,...,-O< R, < oe} 
= P{R,€ B,}. ° . P{R,, € B,} 
since P{—o < R;< of} = 1. If (R,;, ie the index set J), is an 
arbitrary family of random variables on the space (Q, F, P), the 


R, are said to be independent iff for each finite set of distinct indices 
iisetalg ely Rpg eee Rp are independent. 
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We may now give the solution to Problem 2 under the hypothesis of 
independence. 


Theorem 1. Let R,, Ro,...,R, be independent random variables on a 
given probability space. If each R, is absolutely continuous with density f,, 
then (R,, Ro, ..., Ry) is absolutely continuous; also, for all x,,..., x 


Si2--n(%1, Voy sees Ln) = f,(%)fo(%2) _ *fr(%n) 


n? 


Thus in this sense the joint density is the product of the individual densities. 


Proor. The joint distribution function of R,,..., R, is given by 
Pip Chg) SP Re a ag Re ee 
= P{R, < 2,}--- P{R, < x,} by independence 


= [fiu) diy fal) de, 


=|" [play fly) du,:-- du, 


It follows from the definition of absolute continuity [see (2.6.2)] that (R,,..., 
R,,) is absolutely continuous and that the joint density is foo...n(%1,..- 5%) = 


Fi(%)° + Sa(®n)- 


Note that we have the following intuitive interpretation (when n = 2). 
From the independence of R, and R, we obtain 


Pla Ri <u+du,y<oR,<y+ dy} 
= Pie Rk, <x+dzjPly<c Re <y + dy} 


If there is a joint density, we have (roughly) f,.(#, y) dx dy = f,(x) dx 
F2{y) dy, so that fro(z, y) = Aly). 

As a consequence of this result, the statement “Let R,,..., R, be inde- 
pendent random variables, with R,; having density /,,’’ is unambiguous in the 
sense that it completely determines all probabilities of events involving the 
random vector (R,,...,R,); if Bis an n-dimensional Borel set, 


POR ++ Ry) EB) = [++ [ ile) + -fole) dys de, 
B 
We now show that Problem 2 has a negative answer when the hypothesis 


of independence is dropped. We have seen that if (R,,..., R,) is absolutely 
continuous then each R, is absolutely continuous, but the converse is false 
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in general if the R, are not independent; that is, each of the random vari- 
ables R,,...,R, can have a density without there being a density for the 
n-tuple (R,,..., R,). 


> Example 2. Let R, be an absolutely continuous random variable with 
density f, and take R, = R,; that is, R.(w) = R,(w), w EQ. Then R, is 
absolutely continuous, but (R,, R,) is not. For suppose that (R,, R,) has a 
density g. Necessarily (R,, R.) € L, where L is the line y = x, but 


P{(Ry, Ry) ¢L} = || al, ) de dy 
L 


Since L has area 0, the integral on the right is 0. But the probability on the 
left is 1, a contradiction. < 


We can also give an example to show that if R, and R, are each absolutely 
continuous (but not necessarily independent), then even if (Ry, R,) is ab- 
solutely continuous, the joint density is not determined by the individual 
densities. 


> Example 3. Let 


pep=aey. 
w,Y)=<¢ + LY), 
12 4 eye i 
= Q elsewhere 
Since 
1 1 
{ de =| y dy = 0, 
—1 —1 
fe) =|" fol wdy=3, 1S 2s! 
= 0 elsewhere 
fly =4% —-l<y<l 
= 0) elsewhere 
But if 
—l<2x<l 
fie(®, Y) = 4; 
—-l<y<l 
= 0 elsewhere 


we get the same individual densities. < 
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ja 


FIGURE 2.7.6 


Now intuitively, if R, and R, are independent, then, say, e”: and sin Ry 
should be independent, since information about e”: should not change the 
odds concerning R, and hence should not affect sin R, either. We shall prove 
a theorem of this type, but first we need some additional terminology. 

If g is a function that maps points in the set D into points in the set E,f 
and T < E, we define the preimage of T under g as 


g(T) = {xe D: g(x) ET} 


For example, let D= {X1, Lo, 3, Ta}, E= {a, b, ch, g(%) ie 2(%2) = 2 (x3) a 
a, g(x,) = c (see Figure 2.7.6). We then have 


g {a} = {x1, Lo, X53} 
gia, b} Pe {r1, Lo, 3} 
gia, c} = {r1, Lo, X3, 4} 
gb} = 
Note that, by definition of preimage, x € g—1(T) iff g(x) € T. 

Now let R,,..., R, be random variables on a given probability space, and 
let 21,...,2, be functions of one variable, that is, functions from the reals 
to the reals. Let R, = g,(R,),..., R, = g,(R,); that is, R;(w) = g,(R,(@)), 
w € Q. We assume that the R; are also random variables; this will be the case 
if the g; are continuous or piecewise continuous. Specifically, we have the 
following result, which we shall use without proof. 

If g is a real-valued function defined on the reals, and g is piecewise con- 
tinuous, then for each Borel set B < E’, g—1(B) is also a Borel subset of E’. 
(A function with this property is said to be Borel measurable.) 

Now we show that if g, is piecewise continuous or, more generally, Borel 
measurable, R; is a random variable. Let B; be a Borel subset of E*. Then 


Ry"(B;) = {w: RX) € By} 
= {w: g(R()) € Bi} 
= {w: Rw) € g;7°(B)} ©F 
since g;*(B;) is a Borel set. 


+ A common notation for such a function is g: D — E. It means simply that g(a) is defined 
and belongs to E for each x in D. 
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Similarly, if g is a continuous real-valued function defined on E”, then, for 
each Borel set B < E’, g-4(B) is a Borel subset of EZ”. It follows that if 
R,,..., R, are random variables, so is g(R,,..., R,). 


Theorem 2. If R,,...,R, are independent, then Rj,..., Ri, are also 
independent. (For short, “functions of independent random variables are 
independent.’’) 


Proor. If Bj,..., B, are Borel subsets of £1, then 
P{R, € By,..., R, € By} = P{g,(Ri) € By ..-, Sn(Rn) € Br} 
= P{R, € g7°(By),.--, Rn € &n (B,)} 


= |] P{R, € g;*(B))} _ by independence of the R, 
t=1 


= II P{e(R) © Bi} = TI P(R; € BY 


PROBLEMS 


1. Let (R,, R,) have the following density function. 


Sia, y) = 4ay fo<¢~<10<y<l1,rx>y 
= 62" fO<r<1,0<y<1,¢x%<y 
= 0 elsewhere 


(a) Find the individual density functions /; and fp. 
(b) If A = {R, < 3}, B = {R, < $}, find P(A VU B). 


2. If (R,, Re) is absolutely continuous with 


fio@,y) =2e™, = O<y<e 
= 0 elsewhere 
find f,(~) and f,(y). 
3. Let (R,, R,) be uniformly distributed over the parallelogram with vertices 
(—1,0), 0,0), (@, 1), and (0, 1). 
(a) Find and sketch the density functions of R, and Rg. 
(b) A new random variable Rg, is defined by R; = R, + Ro. Show that Ry is 
absolutely continuous, and find and sketch its density. 
4. If R,, Ry, ..., R, are independent, show that the joint distribution function is 
the product of the individual distribution functions; that is, 


Fyyp...n(@15 Vay +++ 9 Gp) = Fy (@y)Fo(@e) - + + Fn) for all real x,...,%, 
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[Conversely, it can be shown that if Fyo...n(%4, ..-,%) = F,(@) ++ - F,(@,) for 
all real z,,...,%,, then R,,..., R, are independent.] 

5. Show that a random variable R is independent of itself—in other words, R 
and R are independent—if and only if R is degenerate, that is, essentially constant 
(P{R = c} = 1 for some c). 

6. Under what conditions will R and sin R be independent? (Use Problem 5 and 
the result that functions of independent random variables are independent.) 

7. If (Ri, ..., Ry) is absolutely continuous and fyo...n(@,..- .%) =fi@y) > ++ fr@n) 
for all 7,,...,%,, show that R,,..., R, are independent. 

8. Let (R,, R,) be absolutely continuous with density fio(%, y) = («@ + y)/8,0 <x 
<2,0 <y <2; fiole, y) = 0 elsewhere. 

(a) Find the probability that R,? + R, < 1. 

(b) Find the conditional probability that exactly one of the random variables 
R,, R, is <1, given that at least one of the random variables is <1. 

(c) Determine whether or not R, and R, are independent. 


2.8 FUNCTIONS OF MORE THAN 
ONE RANDOM VARIABLE 


We are now equipped to consider a wide variety of problems of the following 
sort. If R,,..., R, are random variables with a given joint density, and we 
define R = g(R,,..., R,), we ask for the distribution or density function 
of R. We shall use a distribution function approach to these problems; that 
is, we shall find the distribution function of R directly. There is also a density 
function method, but it is usually not as convenient; the density function 
approach is outlined in Problem 12. The distribution function method can be 
described as follows. 


F(z) = P{R < 2} = P{g(R,,...,R,) < 2} 


= 1S4o) Teale cee ee aa 


glay,..., Xn) Sz 


> Example 1. Let R, and R, be uniformly distributed between 0 and 1, 
and independent. 
(a) Let Rz = R, + Ro. Then, since f,.(x, y) = fi(x)fo(y) by independence, 


F,(2) = P{R, + Ry <2} = | fy(@)foly) dx dy 


LY Sz 


If 0 < z < 1 (see Figure 2.8.1a), 


27 


F(z) = {| 1 dx dy = shaded area = > 


shaded area 
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f3 (2) 


FiGuRE 2.8.1 (a) Calculation of F,(z), 0 < z <1. (6) Calculation of F3(z), 1 < z < 2. 
(c) fs). 


If 1 <2 < 2 (see Figure 2.8.1b), 


= we 
F(z) = shaded area = 1 — or 


Thus f,(z) = 2, 0<z< 1; f/f) =2—-—2, 1<z< 2; f,(z) = 0 elsewhere 
(see Figure 2.8.1c). 


(b) Let R,; = R,R,. (Notice that 0 < R, < 1.) If O < z < 1 (see Figure 
2.8.2), 


F,(2) = P{R,R, <2} = | | fol, y) dx dy 


“yz 


Zz 
—dx=z2—zInz 


1 
2 & 


= shaded area = 2 + { 
f(z) = — Inz 0<z<1 


= 0 elsewhere 
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fs (2) 


(a) (b) 


FIGURE 2.8.2 


(c) Let Rs = max (R,, R,). If 0 < z < 1 (see Figure 2.8.3), 
P{R, < 2} = P{R, < 2, R. < 2} = shaded area = 2? 
[Alternatively, F3(z) = P{R, < z}P{R, < z} = 2? by independence.] 
IZ) = 22; 0<z<l1< 


Before the next example, we introduce the Gaussian or normal density 
function. 


f(*) = us e-(a—a)"/ 2b" x real(b > 0, a any real number) (2.8.1) 
WT 
This is the familiar bell-shaped curve centered at a (Figure 2.8.4); the smaller 


the value of 5, the higher the peak and the more f is concentrated close to 
x = a. To check that this is a legitimate density, we must show that the area 


f3 (2) 


(b) 


FIGURE 2.8.3 
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f(x) 


Larger b 


under fis 1. Let 


Then 
P= { e-*" dx { env dy = | [ ea +0") de dy = (in polar coordinates) 


29 00 2 
ao | re’ adr=7 
so that . 
{ e* dx = Ja (2.8.2) 


— 00 


Thus 


Le )*/28? meas a0 1 o-¥ 
eo dx = {with y = —— —sa¢° dy = 
[- J27b ( @ J2b ] Jo / 7 ‘ 


> Example 2. Let R,, R., and R; be independent, each normally distributed 

(i.e., having the normal density), witha = 0, b = 1. Let R, = (R? + Ro + 

R,*)/?; take the positive square root so fhat Ry > 0. (For example, if 

R,, Re, and R; are the velocity components of a particle, then R, is the speed.) 
Find the distribution function of R,. 


F,(w) = P{R, < w} = P{R,? + R,.” + R,” < w*} 


— | | (Qar) 8 2e—(eh tu +2)/2 og dy dz 
ay tg <w* 
We switch to spherical coordinates: 
x =rsin ¢cos 0 
y =rsin¢sin 6 


2=rcos¢ 
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(¢ is the ‘‘cone angle,” and 6 the “‘polar coordinate angle.’’) Then 
27 7 w 2 
F,(w) =| a6 | a$| (2a)? e* "7" sin db dr 
0 0 0 
= (27) * ‘2m pet! dp 
0 
Thus R, has a density given by 


fiw) = & we w0 
TT 


=0, w<0< 
> Example 3. There are certain situations in which it is possible to avoid 
all integration in an n-dimensional problem. Suppose that R,,..., R, are 
independent and F; is the distribution function of R;,i=1,2,...,n. 


Let T;, be the kth smallest of the R;. [For example, ifn = 4and R,(@) = 3, 
R,(@) = 1.5, R;(@) = — 10, R,(@) = Ts then 


T,(@) = min Rw) = R,(@) = —10, T,(w) = R,(@) = 1.5 
T;(@) = Ry(w) = 3, T,(@) = max R(@) = Ro) = 7] 


(Ties may be broken, for example, by favoring the random variable with the 
smaller subscript.) 

We wish to find the distribution function of 7;,. When k = 1 or k =n, 
the calculation is brief. 


P{T,, < x} = P{max(R,,...,R,) < e} = P{R, < w,...,R, < 2} 
= II P{R,;< x} by independence 
Thus ; 
Fp) = TL F(@) 
P{T, < x} = 1 — P{T, > x} = 1 — P{min(R,,...,R,) > 2} 


=1—P{R,>2,...,R, >2}=1—|] P{R,> 2} 
7=1 
Thus 


Fp(2) =1— 11d - F@) 
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REMARK. We may also calculate F(x) as follows. 


P{T, < x} = P{at least one R; is <2} 
= P(A, U A, U':: UA,) where A; = {R; < x} 
= P(A,) + P(Ay’ 1 Ag) + °°: 
+ P(A A-+: OAS, OA,) — by (1.3.11) 
But 
P(A\o A+: OAL, OA; = P{R, > @,..., Ri > 2%, R; < x} 
= (1 — F(2))-+- (1 — F_(2))F,@) 
Thus 
Fy(”) = F(x) + (1 — F,())Fo(%) + (1 — Fi(a)) — Fo())F a(x) 
+++ (1 — Fe) +++ — Fy a(*))F (*) 


Hence 
1—F,,=(— F)fi — F, —( — F,)F3 — °°: 
= (Lai )eor = a) Fy 
= (1 — F,)(1 =, F,)f{1 = Eg (1 cas F3)F 4 
Pie a ae 
=T[(1- F) 
as above. = 


We now make the simplifying assumption that the R, are absolutely con- 
tinuous (as well as independent), each with the same density f. [Note that 


P{R; = R;} = | f(%)f(%;) du, dx; = 0 (if i # j) 


Cj=2Lj 


Hence 
P{R,; = R; for at least one i 4 j} < > P{R; = R;} =0 
iF i 


Thus ties occur with probability zero and can be ignored. | 

We shall show that the 7;, are absolutely continuous, and find the density 
explicitly. We do this intuitively first. We have 
Ple<T,<«+dx}= Pla <7, < 2+ dx,T, = Ri} 

+ Pla<T, < «+ dzx,T, = Ro ++: + P{e <7, < 2+ dz, T, = Ry} 
by the theorem of total probability. (The events {7;, = R;}, i= 1,...,n, 
are mutually exclusive and exhaustive.) Thus 

Pla<T,< 2+ de} =nP{a < TT, < «+ de, T, = Ri} by symmetry 

= nP{T, = R,,« < Ri < x + dx} 
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Now for R, to be the kth smallest and fall between x and x + dz, exactly 
k — 1 of the random variables R,,..., R, must be <R,, and the remaining 
n — k must be >R, [and R, must lie in (x, x + dxr)]. 

Since there are (7-1) ways of selecting A — 1 distinct objects out of n — 1, 
we have 


Ple<Te<atda}—n( |\Pie<R <2 + de, 


Ro Risscig Rye = Rhea > Redes Ra Rit 
But if R, falls in (7, « + dx), R; < R, is essentially the same thing as R; < 2, 
so that 


n 


Pla <Ty <a + da} =n (" 1) fle) da(P(R, < 2} UP(R, > 2)" 


n( JMG — Fay de 


Since Pia < T, < x + dx} = f,(x) dx, where f;, is the density of T;, (assumed 
to exist), we have. 


fe) =n(7 1) FORO — Fey 


[When k = n we get nf (x)(F(x))”1 = (d/dx)F(x)”, and when k = 1 we get 
nf («)(1 — F(«))"-1 = (d/dx)(1 — (1 — F(x))”), in agreement with the 
previous results if all R; have distribution function F and the density f can 
be obtained by differentiating F.] 

To obtain the result formally, we reason as follows. 


P{T, < 2} = > P{T, < x, T, = R,} = nP{T, < 2, T, = Ry} 
*=1 


= nP{R, < x, exactly k — 1 of the variables R,,..., R, are 
<R,, and the remaining n — k variables are >R,} 
= n(n 1) PR <2,Ro< Ry... Ry < Ry Rey >Ry---s 


R,, > Ry} by symmetry 
a ar) ee ge i 
k — | ®=— 0 L2=—_ 0 Lp=— 00 Ck+1=2%1 
fo fa) fle) day de, 


7 iz (; 7 1) S@dereny*a — F(x,))""* dey, 
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The integrand is the density of 7;,, in agreement with the intuitive approach. 
T,,..., 7, are called the order statistics of R,,..., Ry. 


REMARK. AIl events 


{R, < + R,, < R;.> cee yg Ry, < R;; R; > R;; ore 9 R, > R;3 


2 “etl 


have the same probability, namely, 


[s (%;,) dx,, [3 (x,,) dx," °- [feo dx, com dx... 


L241 


an “f (%;,) dx, 


Li, 


This justifies the appeal to symmetry in the above argument. < 


PROBLEMS 


1. Let R, and R, be independent and uniformly distributed between 0 and 1. 
Find and sketch the distribution or density function of the random variable 
R; = R,/R,?. 

2. If R, and R, are independent random variables, each with the density function 
f(@) =e", 2 >0; f(x) =0,« <0, find and sketch the distribution or density 
function of the random variable R,, where 
(a) Ry = Ry + Rp 
(b) Rg = R/Ry 

3. Let R, and R, be independent, absolutely continuous random variables, each 
normally distributed with parameters a = 0 and b = 1; that is, 


fO=—Oe = onal) 


Find and sketch the density or distribution function of the random variable 
Rg = Ro/R,. 

4. Let R, and R, be independent, absolutely continuous random variables, 
each uniformly distributed between 0 and 1. Find and sketch the distribution 
or density function of the random variable R;, where 


__ max (R,, Re) 
* min (R,, Re) 


REMARK. The example in which R, = max (R,, R.) + min(R,, R,) may 
occur to the reader. However, this yields nothing new, since 


10. 


11. 
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max (R,, R,) + min (R,, R,) = R, + Re, (the sum of two numbers is 
the larger plus the smaller). 


. A point-size worm is inside an apple in the form of the sphere x? + y? + 2? = 


4a”. (Its position is uniformly distributed.) If the apple is eaten down to a core 
determined by the intersection of the sphere and the cylinder x? + y? = a?, 
find the probability that the worm will be eaten. 


. A point (R,, Rz, Rs) is uniformly distributed over the region in £* described 


by 2 + y? < 4,0 <z < 3w. Find the probability that R, < 2R,. 


. Solve Problem 6 under the assumption that (R,, R., Rs) has density f(z, y, z) = 


kz over the given region and f(z, y, z) = 0 outside the region. 


. Let 7,,..., 7, be the order statistics of R,,...,R,, where R,,...,R, are 


independent, each with density f/ Show that the joint density of T,,..., 7, is 
given by 
2(%,.-- 5%) =n fe) °->* fn), ®y <%_g<e+ S ey 
= 0 elsewhere 


HINT: Find P{T, < b,,..., Tn < bn» Ry < Re <+** < Ryh. 


. Let R,, Ra, and Rz be independent, each with density 


f@) =e", x>0 
= 0, x <0 
Find the probability that R, > 2R, > 3Rs. 


A man and a woman agree to meet at a certain place some time between 11 and 
12 o’clock. They agree that the one arriving first will wait z hours,0 <z <1, 
for the other to arrive. Assuming that the arrival times are independent and 
uniformly distributed, find the probability that they will meet. 


If n points R,,..., R, are picked independently and with uniform density on a 
straight line of length L, find the probability that no two points will be less 
than distance d apart; that is, find 


P{min |R; — R,| > d} 
tF#j 
HINT: First find P{min;,;|R; — R,| >d, Ry < R, <--+- < R,}; show that the 
region of integration defined by this event is 
Un—2 +d<%y 1 <L —ad 
Ty 3 +d < Xp» <L — 2d 


t+d<2,<L—-(n—-2)d 
0<27,<L-—(Mm-l1)d 
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12. 


13. 


14. 


15. 
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(The density function method for functions of more than one random variable.) 
Let (R,,...,R,) be absolutely continuous with density fio...n(@1,..-, Un): 
Define random variables W,,..., W, by W; =g(Ri,..., Rn), i =1,2,..., 
n; thus (W,,..., Wy) =g(Ri,.-., Rn). Assume that g is one-to-one, con- 
tinuously differentiable with a nonzero Jacobian J, (hence g has a continuously 
differentiable inverse h). Show that (W,,..., W,) is absolutely continuous 
with density 


Soenn®) =fire---n AQ) AM), Y= Wy--- 54nd 
wee Siz---n(h(y)) 
Fo @lx=aw 


[The result is the same if g is defined only on some open subset D of E” and 
P{(Ry,..., Ry) € D} = 12] 


Let R, and R, be independent random variables, each normally distributed 
with a = 0 and the same b. Define random variables Ry and 4 by 


R, = Ry cos 9% (taking Ry > 0) 
R, = R, sin 4% 
Show that Ry and 4 are independent, and find their density functions. 


Let R, and R, be independent, absolutely continuous, positive random variables 
and let R,; = R,R». Show that the density function of R; is given by 


a | 
f@ = i “h(=) fow)dw, z>0 
=0, z<0 


Note: This problem may be done by the distribution function method or by 
applying Problem 12 as follows. 

R, = R,R, 

Ry = R, 
Use the results of Problem 12 to obtain /3,(z, w) and from this find /3(2). 


Because of inefficiency of production, the resistances R, and R, in Figure 
P.2.8.15 may be regarded as independent random variables, each uniformly 


FiGure P.2.8.15 
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distributed between 0 and 1 ohm. 'Find the probability that the total resistance 
R of the network is <4 ohm. 

16. A chamber consists of the inside of the cylinder x? + y* = 1. A particle at 
the origin is given initial velocity components v, = R, and v, = R,, where R, 
and R, are independent random variables, each with normal density f(x) = 
(22) 1/2 e—#/2, (There is no motion in the z-direction, and no force acting on the 
particle after the initial “‘push” at time ¢ = 0.) If T is the time at which the 
particle strikes the wall of the chamber, find the distribution and density 
functions of 7. 


2.9 SOME DISCRETE EXAMPLES 


In this section we examine some typical problems involving one or more 
discrete random variables. We first introduce the Poisson distribution, which 
may be regarded as an approximation to the binomial when the number n 
of trials is large and the probability p of success on a given trial is small. 

Let R,, be the number of successes in 1 Bernoulli trials, with probability 
Pn of success on a given trial. We have seen (Section 1.5) that R, has the 
binomial distribution;+ that is, the probability function of R,, is 


PrAk) = (7) Pac _ Dy. k= 0, 1, 220 oN 


We now let n— oo, p, 0 in such a way that np, — A= constant. We 


shall show that 


e? k 


A 
Pr() >= k= 0,1... 


To see this, write 


pa,(k) = MPA E ED ar pay 
(i — taj — 2/n) Gd — (ke = Din) inp,( _ "Bay 

k! 
Now (1 — np,,/n)-* > 1 and (1 — np,/|n)" > e~* (Problem 1), and the result 


follows. 
We call 


e “A 


p(k) = rE 


+ When a probabilist says he knows the distribution of a random variable R, he generally 
means that he has some way of calculating P{Re B} for all Borel sets B. For example, he 
might know the distribution function of R, or the probability function if R is discrete, or 
the density function if R is absolutely continuous. Thus to say that R has the normal 
distribution means that R has a density given by the formula (2.8.1). 


> k=0,1,2,... (2.9.1) 
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the Poisson probability function; a random variable which has this prob- 
ability function is said to have the Poisson distribution. (To check that it is a 
legitimate probability function: 


5S nae eee ee eee = e*e4 = 1) 
mM kl 2! eile 


We shall show that if R, and R, are independent, each having the Poisson 
distribution, then R, + R, also has the Poisson distribution. We first need a 
characterization of independence in the discrete case. 


Theorem 1. Let R,,...,R, be discrete random variables on a given 
probability space, with probability functions p,,..., Py. Let Py2...» be the 
joint probability function of R,,..., Rn, defined by 


Dis icin Vinge te) LE Rie Byki Ry = 2; (2.9.2) 
Then R,,..., R, are independent if and only if 


Pie.-. : n(X1; ee) Ly) = p,(2,) as * Dn(Ln) for all v1, se 8 9 Ln, 


Proor. If R,,..., R, are independent, then 


Pre. . + n(21 ee a) = P{R, = X1,5- ene , R, = Ly} 
=: P{R, = 2,}--- P{R, = 2z,} by independence 


= P(X) °° * Py (Zp) 


Conversely, if pis... n(%1,---» Tn) = Pi(%1) °° Pp(%,), then for all one- 
dimensional Borel sets B,,..., B,, 
P{R, € B,,...,R, €B,} = > PIR, = Sync g Rp = ty} 
21€B),...; tnEBn 
%1€B1,...; tnE€Bn 
eyE€By, LnEBn 


= P{R, €B,}--- P{R, € B,} 
Hence R,,..., R, are independent. 
REMARKS. If R, and R, are not independent, the joint probability function of 


R, and R, is not determined by the individual probability functions. 
For example, if P{R,; = 1, R,=1} = P{R, =2, R,=2} = a, 
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P{R, = 1, Ro = 2} = P{R, = 2, R, = 1} = 3 —a,0 Sa <4, then 
P{R, = 1} = P{R, = 2} = gand P{R, = 1} = P{R, = 2} = 4. Thus 
we have uncountably many joint probability functions giving rise 
to the same individual probability functions. 


If we wish to define discrete random variables R,,..., R, having a specified 
joint probability function p,,...,, there is no difficulty in constructing an 
appropriate probability space. Take Q = E", F = all subsets of Q (since 
the random variables are discrete, there is no need to restrict to Borel sets), 
P(B) = diay,... m€B P12--- n(®15--+» Tn), BEF. 

Now let R, and R, be independent, with R,; having the Poisson distribution 


with parameter A,, i = 1, 2. By Theorem 1, the joint probability function of 
R, and Ry is 


i. k 
Pil j, k) = P{R, =j, Rp = k} = o (ArtAg) a = 
ji kK! 


We find the probability function of R,; + Ro. 
P{R, + Rp = m}= 2 Pri, ) 
j+k=m 


a pata) S Ay ay? 
j—-0 j!(m — f)! 
But (A, + A,)™ = > 0(") AJAY”? by the binomial theorem, so that 


e Artaal( + A)” 


P{R, + R, =m} = a 


az, 1,... 


Thus R, + R, has the Poisson distribution with parameter A, + Ap. 

By induction, it follows that the sum of n independent random variables 
R,,..., Ry, where R; is Poisson with parameter /,, has the Poisson distri- 
bution with parameter A, + °°: + A,. 

The use of the Poisson distribution as an approximation to the binomial is 
illustrated in the problems. 


> Example 1. Six unbiased dice are tossed independently. Let R, be the 
number of ones, R, the number of twos; R, and R, have the binomial distri- 
bution with n = 6, p = 1/6; that is, 


aVeV ele 1, 2,3, 4, 5, 6 
k) = pk) = (°\(=\(2) > &=0,1,2,3,4,5, 
Pi(k) = Polk) (;.) (;) (;) 
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Let us find the joint probability function p,.(j, k) of R, and R,. This is a multi- 
nomial problem. 


b, = “1 occurs” ona giventoss p, =+ n=] 
b, = *‘2 occurs”’ pPo=¢ Ny = k n= 6 
b, = “3, 4, 5, or 6 occurs”’ Ps=% n=6—-j—k 


Thus 


; 6! 1 j+k 9) 6—j—k 
Pill, k) ae : (;) (5) ) 
ji! ki (@—j—k)!\6 3 


j,k =0,1,2,3,4,5,6;7+k <6 


Thus the multinomial formula appears as the joint probability function of a 
number of random variables, each of which is individually binomial. 
Now let us find the conditional probability function of R, given R,; that is, 


pi | k) = P{R, =j|R,=k} 
Pro, K) _ [6t/j! kN (6 — j — HNA/6) 4/6)" 


p(k) [6!/k! (6 — k)!](1/6)*(5/6)** 


- (6 =e k)! 48—i-k ee _ 6—k 1 j 4"? 
ji(6—j—k! s** (5) 7 ( J )(5) (5) 
Intuitively, given R, =k, there are 6 — k remaining tosses. The possible 
outcomes are 1, 3, 4, 5, or 6 (2 is not permitted), all equally likely. Thus, 


given R, =k, R, should be binomial with n = 6—k, p = 1/5. This is 
verified by the formal calculation above. 


REMARK. Since the discrete random variables R, and R, are independent iff 
Pilj, k) = piV/)pe(k) for all j, k, it follows that independence is 
equivalent to p,(j | k) = p,(/) for all j, k [such that p.(k) > 0]. 

In the present case p,(j|k) is the binomial probability function 
with n=6—k, p= 1/5, and p,(j) is the binomial probability 
function with n = 6, p = 1/6. Thus R, and R, are not independent. 
This is clear intuitively; for example, if we know that R, = 6, the 
odds about R, are certainly affected ; in fact, R, must be 0. < 


PROBLEMS 


1. (a) If |x| < 1/2, show that 
In(l +2) =a + 62° 


where |4| < 1, 4 depending on =z. 
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(b) Show that if x, — A, then 


n 


. If R has the binomial distribution with n large and p small, the Poisson approxi- 
mation with 4 = np may be used (a rule of thumb that has been given is that the 
approximation will be good to several decimal places if n > 100 and p < .01). 
Feller (An Introduction to Probability Theory and Its Applications, vol. 1, 
John Wiley and Sons, 1950) gives several examples of such random variables: 
(i) The number of color-blind people in a large group (or the number of people 
possessing some other rare characteristic). 

(ii) The number of misprints on a page. 

(iii) The number of radioactive particles (or particles with some other dis- 
tinguishing characteristic) passing through a counting device in a given 
time interval. | se 

(iv) The number of flying bomb hits on a particular area of London during 
World War II (x is the number of bombs in a given period of time, p the 
probability that a single bomb will hit the area). 

(v) The number of raisins in a cookie. 

[Here the assumptions are not entirely clear. Perhaps what is envisioned is that 

the dough is bombarded by a raisin gun at some stage in the cookie-making 

process. It would seem that this is simply a peaceful version of example (iv).] 

In the following exercises, use the Poisson approximation to calculate the 
probabilities. 


(a) If p = .001, how large must n be if P{R > 1} > .99? 

(b) If mp = 2, find P{R > 3}. 

. The joint probability function of two discrete random variables R, and R, is as 
follows: 


Pw, 1) = 4 
Pro, 2) = .3 
P22, 1) = .2 
Pi2(2, 2) = «1 


Pis(j,k) =0 elsewhere 


(a) Determine whether or not R, and R, are independent. 
(b) Find the probability that R,R, < 2. 


. Let R, and R, be independent; assume that R, has the binomial distribution 
with parameters m and p, and R, has the binomial distribution with parameters 
m and p. Find P{R, =/j|R, + R, =k}, and interpret the result intuitively. 
[Note: one approach involves establishing the formula ("1") = SEa COG] 
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3.1 INTRODUCTION 


We begin here the study of the long-run convergence properties of situations 
involving a very large number of independent repetitions of a random experi- 
ment. As an introductory example, suppose that we observe the length of a 
telephone call made from a specific phone booth at a given time of the day, 
say, the first call after 12 o’clock noon. Suppose that we repeat the experiment 
independently n times, where n is very large, and record the cost of each 
call (which is determined by its length). If we take the arithmetic average of 
the costs, that is, add the total cost of all » calls and then divide by n, we 
expect physically that the arithmetic average will converge in some sens2 to a 
number that we should interpret:as the long-run average cost of a call. We 
shall try first to pin down the notion of average more precisely. 
Assume that the cost R, of a call in terms of its length R, is as follows. 


If O < R, < 3 (minutes) R, = 10 (cents) 

If3< R, < 6 R, = 20 

If6< Rk <9 R, = 30 
(Assume for simplicity that the telephone is automatically disconnected after 
9 minutes.) 
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Thus R, takes on three possible values, 10, 20, and 30; say P{R, = 10} = 
.6, P{R, = 20} = .25, P{R, = 30} = .15. If we observe N calls, where N 
is very large, then, roughly, {R, = 10} will occur .6N times; the total cost of 
calls of this type is 10(.6N) = 6N. {R, = 20} will occur approximately .25N 
times, giving rise to a total cost of 20(.25N) = 5N. {R, = 30} will occur 
approximately .15.NV times, producing a total cost of 30 (.15N) = 4.5N. The 
total cost of all calls is 6N + 5N + 4.5N = 15.5, or 15.5 cents per call on 
the average. 

Observe how we have computed the average. 


10(.6N) + 20(.25N) + 30(.15N) 


- = 10(.6) + 20(.25) + 30(.15) 


= 2 yP{R. = 9} 


Thus we are taking a weighted average of the possible values of R,, where the 
weights are the probabilities of R, assuming those values. This suggests the 
following definition. 

Let R be a simple random variable, that is, a discrete random variable 
taking on only finitely many possible values. Define the expectation [also 
called the expected value, average value, mean value, or mean] of R as 


E(R) = > x<P{R = 2] (3.1.1) 
Since R is simple, this is a finite sum and there are no convergence problems. 
In particular, if R is identically constant, say R = c, then E(R) = cP{R = 
c} = c. For short, 


E(c)=c (3.1.2) 


Note that if R takes the values 2,,..., 2,, each with probability 1/n, then 
E(R) = (a + °+* + 2%,)/n, as we would expect intuitively. In this case each 
x, is given the same weight, namely, 1/n. 

We now have the problem of extending the definition to more general 
random variables. If Ris an arbitrary discrete random variable, the natural 
choice for E(R) is again >, «P{R = x}, provided that the sum makes sense. 
(Theorem 1 will make this precise.) 

Similarly, let R, be discrete and R, = g(R,). Since R, is also discrete, we 
have E(R,) = >, yP{R, = y}. However, if x,, %,,... are the values of R,, 
then with probability Pr,(%) we have R, = 2;, hence R, = g(z,). Thus if 
our definition of expectation is sound, we should have the following alter- 
nate expression for E(R,): 


E(R,) = Elg(R,)] = > g(%)pr,(*) (3.1.3) 


Z 
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fr, (%) 


x; x; + dx; 


Ficure 3.1.1 


again a weighted average of possible values of R,, but expressed in terms of 
the probability function of R,. 

If R, is absolutely continuous, this approach breaks down completely, 
since P{R, = x} = 0 for all x. However, we may get some idea as to how to 
compute E(R,) = E[g(R,)] when R, is absolutely continuous, by making a 
discrete approximation. If we split the real line into intervals (w,;, x; + dz,], 
then, roughly, the probability that 7; << R, < 2, + dz, is fp,(%,) dx, (see 
Figure 3.1.1). If R, falls into this interval, g(R,) is approximately g(2,), 
at least if g is continuous. Thus an approximation to E(R,) should be 

p3 8(%)fr(x,) dx, 
which suggests that if a general definition of expectation is formulated 
properly, and R, is absolutely continuous, 


BIgtR)] =| e(@)fa,(2) da 3.1.4) 
In the telephone call example above, if R, has density f,, we obtain 
B(R) =| gfe) de 
where 
=20 if3<R, <6 
=30 if6<R, <9 
Thus 


E(R,) = 10 { $e dx + 20 [ “f(2) dx + 30 | f@) dx 


= 10(.6) + 20(.25) + 30(.15) as before 


If we have an n-dimensional situation, for instance Ry = g(R,,..., Ra); 
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the preceding formulas generalize in a natural way. If R,,..., R, are dis- 
crete, 


E[g(R,,---,R,]= > 2(%,...,%,)P{R, = %,...,R, = %,} (3.1.5) 


oan? 


If (Rises de) is absolutely continuous with density fj. ... »; 


(o@) 


E[g(Rj, aes R,,)] ={ _ |" (2%, eee 9 En) fis. ..n( 1, ee Xn) Ax, ahh Ax, 
; (3.1.6) 


We shall outline very briefly a general definition of expectation that in- 
cludes all the previous special cases. 
If R is a simple random variable on (Q, ¥, P), we define 


E(R) = > #P{R = x} 
just as above. Now let R be a nonnegative random variable. We approximate 


R by simple random variables as follows. 
Define 


R,(w) = <= i 


< R(w) <=, eee Po en 
and let 
R,(w) =n if R(w) >n 


(see Figure 3.1.2 for an illustration with n = 2). 

For any fixed w, eventually R(w) < n, so that 0 < R(w) — R,(@) < 2. 
Thus R,(@) > R(@). In fact R,(@) < Rz4,(@) for all n, w. For example, 
if 3/4 < R(w) < 7/8, then R,(w) = R;(w) = 3/4; if 7/8 < R(w) < 1, then 
R,(w) = 3/4, R3(@) = 7/8. In general, if R(@) lies in the lower half of the 
interval [(k — 1)/2”, k/2”), then R,(@) = R,,,(@); if R(@) lies in the upper 
half, R,(@) < R,41(@). 

Thus we have constructed a sequence of nonnegative simple functions R,, 
converging monotonically up to R. We have already defined E(R,), and 
since R, < R,,, we have E(R,) < E(R,4,). We define 


E(R) = lim E(R,,) (this may be +00) 
It is possible to show that if {R,} is any other sequence of nonnegative simple 
functions converging monotonically up to R, 


lim E(R;,) = lim E(R,) 


n> © n> © 


and thus E(R) is well defined. 
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R(w) 
(ee. 
7/4 
is eee? 
5/4 ne 
1 
o_o! 
2/4 | 
1/4 
6) 
Ro (w) 
2 
7/4 
6/4 
5/4 
l 
3/4 
2/4 


(¢) 


Ficure 3.1.2 Approximation of a Nonnegative Random Variable by Simple Random 
Variables. 


Finally, if R is an arbitrary random variable, let Rt = max (R, 0), 
R- = max (—R, 0); that is, 


R*(@) = = R(o) if R(w) > 0; R*(w) = 0 if R(w) < 0 
R-(@) = —R(a) if R(w) < 0; R-(@) = 0 if R(w) > 0 


R+ and R- are called the positive and negative parts of R (see Figure 3.1.3). 
It follows that R = Rt — R- (and |R| = R* + R>), and we define E(R) = 
E(Rt) — E(R-) if this is not of the form -+-0o — oo; if it is, we say that the 
expectation does not exist. Note that E(R) is finite if and only if E(R*) and 
E(R-) are both finite. Since it can be shown that E(|R]) = E(Rt) + E(R), 
it follows that 

E(R) is finite if and only if E(|R)}) is finite (3.1.7) 


The expectation ofa nonnegative random variablealwaysexists; it may be + 00. 

The following results may be proved. 

Let R,, Ro,...,R, be random variables on (Q, F,P), and let Ry = 
g(R,,...,R,), where g is a function from E” to £1. Assume that g has the 
property that g—1(B) is a Borel subset of E” whenever B is a Borel subset of 
E‘. Then, as we indicated in Section 2.7, R, is a random variable. 


3.1 INTRODUCTION 105 


R(w) 
w 

Rt(w) 
w 

R~(w) 
w 


FiGurE 3.1.3 Positive and Negative Parts of a Random Variable. 


Theorem 1. If R,,...,R, are discrete, then 
| eo xv 


E[g(R,,...,R, J= > 2(%,...,%,)P{R, = %,...,R, = 2,} 


if g(%,...,%,) > 0 for all x,,...,%,, or if the series on the right is ab- 
solutely convergent. 


Theorem 2. If (Ry,..., R,) is absolutely continuous with density fis... ns 
then 


Fig... =| of dC oa ne) 0 Se (Pe 0) 7 a 7 


if g(%,...5%,) > O for all a,...,2,, or if the integral on the right is 
absolutely convergent. 


We shall look at examples that are neither discrete nor absolutely con- 
tinuous in Chapter 4. 


Notice that it is quite possible for the expectation to exist and be infinite, 
or not to exist at all. For example, let 


fal) = a>; fr(%) = 0> a<il 


(see Figure 3.1.4a). Then 


E(R) =|" tf p(x) da a [es dx = 0 
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fp(*) fp(x) 


(a) 


FiGuRE 3.1.4 (a) E(R) = ©. (b) E(R) Does Not Exist. 


As another example, let fp(x) = 1/2x?, |x| > 1; fe(x) = 0, |x| < 1 (Figure 
3.1.4b). Then (see Figure 3.1.5) 


E(R*) = [fale da = | efele) dz = fe . dz = 00 


E(R7) =|" fal) dx =[ —ahele) dx = af 


Thus E(R) does not exist. 

Finally it can be shown that if two random variables R, and R, are 
“essentially’’ equal, that is, if P{w: R,(w) ~ R,(w)} = 0, then E(R,) = E(R,) 
if the expectations exist. 


REMARK. Theorem 1 fails if the series on the right is conditionally but not 
absolutely convergent. For example, let P{R, = n} = (1/2)",n = 1, 
2,...,and R, = g(R,), where g(n) = (—1)"*12"/n. If Ry(w) = 7, 
n odd, then R,(w) = 2"/n; hence R}(w) = g(n) = 2"/n, Rz(w) = 0. 
If R,(w) =n, n even, then R,(w) = —2"/n; hence Ri(@) = 0, 
R,(@) = —g(n) = 2"/n. Therefore, by the nonnegative case of 
Theorem 1, 


E(RS) = > g(n)P{R, =n} =1+44+4t4+---=0 


odd 


x 


FIGURE 3.1.5 
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and 
E(Rz) = > —g(n)P{R, =n} =h+h4+44+-°°= 00 


neven 


Hence E(R,) does not exist, although 
> o(n)P{R, =n} =1—4+4+4-—44- 
n=1 


is conditionally convergent. From an intuitive standpoint, the 
expectation should not change if the series is rearranged; a 
conditionally but not absolutely convergent series will not have this 


property. 
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If R is a random variable on a given probability space, the kth moment of 
R (k > 0, not necessarily an integer) is defined by 


a, = E(R") _ if the expectation exists 
Thus 


a, = > x"p,(x) if R is discrete 


= { z"fp(z) dz if R is absolutely continuous 


a, is simply E(R), the expectation of R, often written as m and called the 
mean of R. If R has density fp, m may be regarded as the abscissa of the 
centroid of the region in the plane between the x-axis and the graph of fp 
(see Figure 3.2.1). To see this, notice that the total moment of the region 


fo(2. 


m x x+dx 


FicurE 3.2.1 Geometric Interpretation of E(R). The “Strip” Between x and x + dx 
Contributes (« — m)fp(x) dx to the Moment about « = m. 
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about the line x = m is 


[-@ = m fale) de =m —m =0 


The expectation of R is a measure of central tendency in the sense that the 
arithmetic average of n independent observations of R converges (in a sense 
yet to be made precise) to E(R). 

The kth central moment of R (k > 0) is defined by 


p, = E[((R — m)"] _ if m is finite and the expectation exists 


= > (x —m)*p,(x) _ if R is discrete 


= { (x — m)"fp,(x) dx if R is absolutely continuous 


Notice that 6, = E(R — m) = m—m= 0. 

Bb, = E[(R — m)*] is called the variance of R, written o7, o?(R), or 
Var R. o (the positive square root of f) is called the standard deviation of 
R. Note that if R has finite mean, then, since (R — m)* > 0, Var R always 
exists; it may be infinite. 

If R has density fp, the variance of R may be regarded as the moment of 
inertia of the region in the plane between the x-axis and the graph of fp, 
about the axis x = m. 

The variance may be interpreted as a measure of dispersion. A large vari- 
ance corresponds to a high probability that R will fall far from its mean, while 
a small variance indicates that R is likely to be close to its mean (see Figure 
3.2.2). We shall make a quantititative statement to this effect (Chebyshev’s 
inequality) in Section 3.7. 


p Example 1. Consider the normal density function 


fr(®) = Te ; e (ea /20" , b > 0, a real 
qT 


Since fz is symmetrical about x = a, the centroid of the area under fp has 
abscissa a, so that E(R) = a. We compute the variance of R. 


2 
= (x vee a) o(@-a)"/26" 7 


La dome 
Let y = (x — a)//2 b. We obtain 


foe) 2 or 2 re) 2 
=| phe ib dy == | y’e” dy 


c= 
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fa(*) 


Small v2 


Larger o2 


FIGURE 3.2.2 


Now, by (2.8.2), 


Vr=[ eta) 


—@ 


Integrate by parts to obtain 


Ja = ye" ]2,, — | 


—o 


(o.@) 


~ 2y? ee" dy 
It follows that 


{ ye! dy = har 


Hence o? = b?, Thus we may write 


fog(2) = te (om (3.2.1) 
a 27 0 
In this case the mean and variance determine the density completely. 
If R has the normal density with mean m and variance o”, we sometimes 
write “‘R is normal (m, o*)”’ for short. < 
Before looking at the next example, it will be convenient to introduce the 
gamma function, defined by 


VQ): = { a”—*e-* das r>0 (3.2.2) 
0 


Integrating by parts, we have 


Thus 
r+ 1)=rlQ (3.2.3) 
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Since 
(1) -| e*dx=1 
0 


we have 
rQ)=1FQa)=1, 3)=2%Q)=2:-1, T4&)=3'B)=3-2:-1=3! 


and 
Cin + 1) =n!, (a 0 a ee (3.2.4) 


We also need ['(1/2). 
T(4) =| a e-* dx 
Let x = y” to obtain : 
Ta) = { 1 oy dy = 2 [ ev dy = [ e"" dy 
0 Y 0 —oo 
By (2.8.2), we have _ 
TQ =V7 (3.2.5) 


> Example 2. Let R, be absolutely continuous with density f{(#) = e*, 
a > 0;f,(«) = 0, 2 < 0. Let R, = R,?. We may compute E(R,) in two ways. 


1. E(R,) = E(R,’) =| u*f,(a) dx =| x’e* dx = T(3) =2 by (3.2.4) 
—o 0 
2. We may find the density of R, by the technique of Section 2.4 (see 
Figure 3.2.3). We have 
vy 


d - e 
fly) = fy) iy! Wri y>0 


== Q, y <0 


Ro 


Ry 


Wy 
FIGURE 3.2.3 Computation of Density of Ro. 
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Then 
B(R) =| ufly) ay 
~ ova 
= —d 
| 2./y . 
= (with y = 2”) [ w*e-* dx = 2 
as before. : 


Notice that both methods must give the same answer by Theorem 2 of 
Section 3.1. For R,(w) = (R,(@))?; applying the theorem with g(R,) = R,?, 
we obtain 


BR!) =|" aXf(e) de 
Applying the theorem with g(R,) = Rz, we have 
BR) =|" whtw) dy 


Generally the first method is easier, since the computation of the density of 
R, is avoided. < 


p> Example 3. Let R, and R, be independent, each with density f(x) = e*, 
x > 0; f(x) = 0, x < 0. Let Rs = max (R,, R.). We compute E(Rs). 


E(Rs) = Ele(R,, R;)] = [ ° [ ” ox, YD Sialw, ¥) dx dy 


= | { max (x, y)e *e * dx dy 
0 0 


Now max (#%, y) =a if «> y; max (w%,y) = y if x < y (see Figure 3.2.4). 


Thus 
E(Rs) = | frente dx dy + | fyerre dx dy 
A B 


(o@) H io6) y 
=| xe | e “dy dz +{ ye | e "dx dy 
=0 y=0 y=0 x=0 


y 


on A, max (a, y) = x 


on B, max (az, y) = y A 


FIGURE 3.2.4 
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The two integrals are equal, since one may be obtained from the other by 
interchanging x and y. Thus 


E(R3) = 2 ve | e*dydzx= 2] we *(1—e*) dx 
0 0 0 


=2| e* da — 2{ % 2 @ 31) = 
0 0 2 2 


The moments and central moments of a random variable R, especially the 
mean and variance, give some information about the behavior of R. In 
many situations it may be difficult to compute the distribution function of R 
explicitly, but the calculation of some of the moments may be easier. We 
shall examine some problems of this type in Section 3.5. 

Another parameter that gives some information about a random variable 
R is the median of R, defined when Fp is continuous as a number y (not 
necessarily unique) such that Fp(u) = 1/2 (see Figure 3.2.5a and b). 

Tn general the median of a random variable R is a number yu such that 


Fru) = P{R <p} >h 


FR) = PIR < wy < 3 
(see Figure 3.2.5c). 
Loosely speaking, yw is the halfway point of the distribution function of R. 


Fp (x) 


FIGURE 3.2.5 (a) « is the Unique Median. (b) Any Number Between a and 5 is a Median. 
(c) w is the Unique Median. 


3.2 TERMINOLOGY AND EXAMPLES 113 


PROBLEMS 


1. Let R be normally distributed with mean 0 and variance 1. Show that 


ECR") = 0, n odd 
= (n — 1)(n — 3)---(5)G)(),_—meven 


2. Let R, have the exponential density (7) = e*, > 0; fi) =0, « <0. Let 
R, = g(R,) be the largest integer <R, (if0 < R, <1, R, = 0; if 1 < R, <2, 
R, = 1, and so on). 

(a) Find E(R,) by computing [%,, g(*)fi() dz. 
(b) Find E(R,) by evaluating the probability function of R, and then computing 
>y YP RY): | 


3. Let R, and R, be independent random variables, each with the exponential 
density f(~) = e"*, x > 0; f(x) = 0,  < 0. Find the expectation of 
(a) RR, 
(b) Ry — Ry 
(c) |Ri — R,| 


4, Let R, and R, be independent, each uniformly distributed between —1 and +1. 
Find E[max (R,, R,)]. 


5. Suppose that the density function for the length R of a telephone call is 
f(@) =xe™, x >0 


= 0, x<0 


The cost of a call is 
C(R) = 2, 0<R<3 


=2 + 6(R — 3), R > 3 


Find the average cost of a call. 


6. Two machines are put into service at ¢ = 0, processing the same data. Let R; 
(i = 1, 2) be the time (in hours) at which machine 7 breaks down. Assume that 
R, and R, are independent random variables, each having the exponential density 
function f(x) = de~**, « > 0; f(~) = 0, x < 0. Suppose that we start counting 
down time if and only if both machines are out of service. No repairs are allowed 
during the working day (which is T hours long), but any machine that has 
failed during the day is assumed to be completely repaired by the time the next 
day begins. For example, if T = 8 and the machines fail at ¢ = 2 and ¢ = 6, 
the down time is 2 hours. 

(a) Find the probability that at least one machine will fail during a working day. 
(b) Find the average down time per day. (Leave the answer in the form of an 
integral.) 


7. Show that if R has the binomial distribution with parameters n and p, that is, R 
is the number of successes in » Bernoulli trials with probability of success p on 


114 EXPECTATION 


a given trial, then E(R) = np, as one should expect intuitively. HINT: in 
E(R) = >”, k(™)p*(1 — p)”-*, factor out np and use the binomial theorem. 


REMARK. In Section 3.5 we shall calculate the mean and variance of R in an 
indirect but much more efficient way. 


8. If R has the Poisson distribution with parameter 4, show that 
E(R(R — 1)(R — 2)---(R-rt+1)) =27 


Conclude that E(R) = Var R = 4. 


3.3 PROPERTIES OF EXPECTATION 


In this section we list several basic properties of the expectation of a random 
variable. A precise justification of these properties would require a detailed 
analysis of the general definition of E(R) that we gave in Section 3.1; what 
we actually did there was to outline the construction of the abstract Lebesgue 
integral. Instead we shall give plausibility arguments or proofs in special 
cases. 

1. Let R,,..., R, be random variables on a given probability space. Then 


E(R, + +++ + R,) = E(Ri) + +++ + E(R,) 


CAUTION. Recall that E(R) can be +00, or not exist at all. The complete 
statement of property 1 is: If E(R;) exists for all i= 1,2,...,n, 
and +oo and —oo do not both appear in the sum E(R,) +°°° + 
E(R,,) (+ © alone or — 0 alone is allowed), then E(R,; + °°: + R,) 
exists and equals E(R,) + °°: + E(R,). 


For example, suppose that (R,, R,) has density fi, and R’ = g(R,, R2), 
R" = h(R,, Rp). Then 


E(R’ + R") = E[g(Ry, Re) + A(Ri, Re)] 


-{" {" [e(a, y) + h(x, y)]fie(a, y) dx dy 


—e 


[Efi se, y)fiel@, y) dx dy +] [one Y) fio(e, y) dx dy 


= E(R’) + E(R”) 
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2. If Risa random variable whose expectation exists, and a is any real 
number, then E(aR) exists and 


E(aR) = aE(R)+ 


For example, if R, has density f, and R, = aR,, then 


E(R,) = { ° axf,(x) dx = aE(R,) 


Basically, properties 1 and 2 say that the expectation is linear. 
3. If Ry < R,, then E(R,) < E(R,), assuming that both expectations exist. 


For example, if R has density f, and R, = g(R), R, = A(R), and g <A, 
we have 


BR) = | eCefla)de < | hayfte) de = BR, 


4. If R> Oand E(R) = 0, then R is essentially 0; that is, P{R = 0} = 1. 


This we can actually prove, from the previous properties. Define R, = 0 if 
0< R<i1/n; R, = I/nif R> I/n. ThenO < R, < R, so that, by property 
3, E(R,) = 0. But R,, has only two possible values, 0 and 1/n, and so 


E(R,) = > yPr,(y) = OP{R, = 0} + ; PLR, — 7 
y . 


Thus 

1 1 

PIR, = : = PIR > . = 0 for all n 

n n 

But 
P{R>0}=P| UlR>"]| < > P|R> “| =0 
n=1 nh n=1 n 

Hence 


P{R=0}=1 
Notice that if R is discrete, the argument is much faster: if >. 2Pr(2) = 
0, then xpp(x) = 0 for all x > 0; hence pp(x) = 0 for x > 0, and therefore 
PRO) = 1. 
COROLLARY. If Var R = 0, then R is essentially constant. 
Proor. If m = E(R), then E[(R — m)*] = 0, hence P{R = m} = I. 


{ Since E(R) is allowed to be infinite, expressions of the form 0 0 will occur. The most 
convenient way to handle this is simply to define 0 - co = 0; no inconsistency will result. 
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5. Let R,,..., R, be independent random variables. 
(a) If all the R,; are nonnegative, then 


(b) If E(R,) is finite for all i (whether or not the R,; > 0), then E(R,R,° °° 
R,,) is finite and 
E(R,Re°** R,) = E(Ri)E(R2) ++ + E(Ry) 


We can prove this when all the R, are discrete, if we accept certain facts 
about infinite series. For 


E(R,R,* +: R,) = >: Tyo ° °° Vedra... n( Bs ~- +5 En) 


LL acees Ln 


= > Uy °° * L_Py(%) °° * Dal Lp) 


LVisveces Ln 


Under hypothesis (a) we may restrict the z,’s to be > 0. Under hypothesis 
(b) the above series is absolutely convergent. Since a nonnegative or ab- 
solutely convergent series can be summed in any order, we have 


E(R,R,° ++: Ry) = 3 %DPy(%)° °° 2 LyDy(%_) = E(R,)E(R2)* ++ E(R,) 


If (R,,..., R,) is absolutely continuous, the argument is similar, with sums 
replaced by integrals. 


BURR Ry) = | of Se Tey IPR Om CO) 9 Pa 7 
“] of EE SACK EACHY eee 


= |" fe) de, |" aflen) de, 


6. Let R be a random variable with finite mean m and variance o* (possibly 
infinite). If a and 5 are real numbers, then 


Var (aR + b) = ao? 


ProoFr. Since E(aR + b) = am + b by properties 1 and 2 [and (3.1.2)], 
we have 
Var (aR + b) = E[(aR + b — (am + bD))?] 
= E[a*?(R — m)?| 
= @E[(R — m)*] by property 2 


= qo", 
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7. Let Ry,...,R, be independent random variables, each with finite 
mean. Then 
Var (R, +°-° + R,) = Var R, + °°: + Var R, 


Proor. Let m, = E(R,). Then 


n n 2 n 2 
veto onaaal (Sx -Snf] = o[(See-mf] 
7=1 i=1 i=1 
If this is expanded, the “‘cross terms”’ are 0, since, if i ¥ j, 
E[(R; — m,)(R; — m;)] = E(R,R; — m;R; — mR; + myn;) 
= E(R,)E(R;) — m,E(R;) — m;E(R,) + myn; 
by properties 5, 1, and 2 
=0 since E(R,) = m,, E(R;) = m; 


Thus 
i=1 i=1 
Corotiary. If R,,..., R, are independent, each with finite mean, and 


a,,...,4,, 6 are real numbers, then 
Var (a,R, + °°: + a4,R, + 6) =a, Var Ri +--+ + 4,2 Var R, 


Proor. This follows from properties 6 and 7. (Notice that a,R,,...,a,R,, 
are still independent; see Problem 1.) 


8. The central moments f,,..., 8, (m > 2) can be obtained from the 
moments «,,...,,, provided that a,,...,a, 1 are finite and «, exists. 


To see this, expand (R — m)” by the binomial theorem. 


(R — m)" = > (,)R-m 


k= 


Thus 
n « n n— 
By = ER — m)"] = (7)(—my%ay 
r=0 \k 
Notice that since «,,...,,_, are finite, no terms of the form +0 —o 


can appear in the summation, and thus we may take the expectation term by 
term, by property 1. 

This result is applied most often when n = 2. If R has finite mean [E(R?) 
always exists since R? > 0], then (R — m)? = R? — 2mR + m?*; hence 


Var R = E(R?) — 2mE(R) + m2 
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That is, 


o? = E(R*) — [E(R)P° (3.3.1) 


which is the “mean of the square”’ minus the “‘square of the mean.”’ 


9. If E(R*) is finite and 0 <j < k, then E(R,) is also finite. 


PROOF 
IR(@)|? < |R@)|* — if |R@)| > 1 
<1 if |R(@)| < 1 
Thus 
|R(w)|? < 1 + |R(@)|* for all w 
Hence 


E(|R|’) < 1 + ERI) < 


and the result follows. Notice that the expectation of a random variable is 
finite if and only if the expectation of its absolute value is finite; see (3.1.7). 


Thus in property 8, if «,_, is finite, automatically «,,..., ,_» are finite 


as well. 


REMARK. Properties 5 and 7 fail without the hypothesis of independence. 


For example, let R, = R, = R, where R has finite mean. Then 
E(R,R.) ~ E(R,)E(R.) since E(R?) — [E(R)?? = Var R, which is 
>0 unless R is essentially constant, by the corollary to property 4. 
Also, Var (R, + R.) = Var (2R) = 4 Var R, which is not the same 
as Var R, + Var R, = 2 Var R unless R is essentially constant. 


PROBLEMS 


1. 


If R,,...,R, are independent random variables, show that a,R, + b,,..., 
a,R, + 6, are independent for all possible choices of the constants a; and 5;. 


. If R is normally distributed with mean m and variance o*, evaluate the central 


moments of R (see Problem 1, Section 3.2). 


. Let 6 be uniformly distributed between 0 and 27. Define R; = cos 9, R. = sin 0. 


Show that E(R,R,) = E(R,)E(R,), and also Var (R; + R,) = Var R, + Var Rz, 
but R, and R, are not independent. Thus, in properties 5 and 7, the converse 
assertion is false. 


. If E(R) exists, show that |E(R)| < E(|R)). 
. Let R be a random variable with finite mean. Indicate how and under what 


conditions the moments of R can be obtained from the central moments. In 
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particular show that E(R*) < o if and only if Var R < «©. More generally, 
a,, is finite if and only if 8, is finite. 


3.4 CORRELATION 


If R, and R, are random variables on a given probability space, we may define 
joint moments associated with R, and R, 


aj, = E(R,’R,*), iELk>O 
and joint central moments 
Bi, = E[(R, — m)?(Re — my)*], m, = E(R,), mz = E(R2) 


We shall study 6,, = E[(R, — m,)(Re — me)] = E(R, Re) — E(R,)E(R:), 
which is called the covariance of R, and R,, written Cov (R,, R,). 

In this section we assume that E(R,) and E(R,) are finite, and E(R,R») 
exists; then the covariance of R, and R, is well defined. 


Theorem I. If R, and R, are independent, then Cov (R,, R.) = 0, but not 
conversely. 


Proor. By property 5 of Section 3.3, independence of R, and R, implies 
that E(R,R.) = E(R,)E(R2); hence Cov (R,, R,) = 0. An example in which 
Cov (R,, R.) = 0 but R, and R, are not independent is given in Problem 3 
of Section 3.3. 


We shall try to find out what the knowledge of the covariance of R, and 


R, tells us about the random variables themselves. We first establish a very 
useful inequality. 


Theorem 2 (Schwarz Inequality). Assume that E(R,”) and E(R,*) are finite 
(R, and R, then automatically have finite mean, by property 9 of Section 3.3, 
and finite variance, by property 8). Then E(R,R2) is finite, and 


JE(Ri Re)? < ECRy)E(Rs”) 


Proor. If R, is essentially 0, the inequality is immediate, so assume R, 
not essentially 0; then E(R,*) > 0. For any real number = let 


A(x) = E[(# |R,| + |Rel)?] = ECR*)2? + 2E(|R, Rol) + ECR”) 


Since A(x) is the expectation of a nonnegative random variable, it must be 
>0 for all x. The quadratic equation h(x) = 0 has either no real roots or, at 
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h(x) 


/ 
ri Not possible 


FiGure 3.4.1 Proof of the Schwarz Inequality. 


worst, one real repeated root (see Figure 3.4.1). Thus the discriminant must 
be <0; hence 
(E(|R,Rel))?-< ECR )E(R?) < 00 


Since E(|R,R,]) is finite, so is E(R,R.), by (3.1.7). Furthermore, |E(R,R.)| < 
E(|R,R,|) (Problem 4, Section 3.3), and the result follows. 


Now assume that E(R,”) and E(R,”) are finite and, in addition, that the 
variances o,? and o,” of R, and R, are >0. Define the correlation coefficient 
of R, and R, as 
Cov €R,, Re) 


010% 


p(Ri, Rg) ar 


By Theorem 1, if R, and R, are independent, they are uncorrelated; that is, 
p(R,, R,) = 0, but not conversely. 


Theorem 3; — / < p(R,, Rz) < 1. 


Proor. Apply the Schwarz inequality to R, — E(R,) and R, — F(R). 
JE[(Ri — ER,)(R, — ER)? < E[(Ri — ER,)PJEL(R: — ER;)?] 


Thus |Cov (R,, Re)|? < 0,047, and the result follows. 


We shall show that p is a measure of linear dependence between R, and R, 
[more precisely, between R, — E(R,) and R, — E(R,)], in the following sense. 

Let us try to estimate R, — ER, by alinear combination c(R, — ER,) + d, 
that is, find the c and d that minimize 


E{[(R, — ER,) — (c(R, — ER,) + d)P} 
= 0,2 — 2c Cov (R,, Ro) + c2a,? + a? 


= 6," — 2c po,0, + c*o,7 + d 
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Clearly we can do no better than to take d= 0. Now the minimum of 
Ax? + 2Bx + D occurs for « = —B/A; hence o,2c? — 2po,o.c + o,7 is 
minimized when 

Sy te as, 8 

= =p 


07 07 


C 


Thus the minimum expectatiofi is o” — 2p?o,? + po,” = o,?(1 — p?). 
For a given o,”, the closer |p| is to 1, the better R, is approximated (in the 
mean square sense) by a linear combination aR, + b. In particular, if 
|p| = 1, then 


2 

B| (Rs — ER, — 2 (R, — ER) | 216 
O71 

so that 


R, — ER, = -2(R, — ER;) 
O71 
with probability 1. 

Thus, if |p| = 1, then R, — E(R,) and R, — E(R,) are linearly de- 
pendent. (The random variables R,,..., R, are said to be linearly dependent 
iff there are real numbers a,,...,a,, not all 0, such that P{a,R, +:-:: + 
a,R, = 0} = 1.) Conversely, if R, — E(R,) and R, — E(R,) are linearly 
dependent, that is, if a(R, — ER,) + b(R, — ER.) = 0 with probability 1 
for some constants a and b, not both 0, then |p| = 1 (Problem 1). 


PROBLEMS 


1. If R, — E(R,)and R, — E(R,) are linearly dependent, show that | p(R,, R.)| = 1. 


2. If aR, + bR, = c for some constants a, b, c, where a and b are not both 0, 
show that R,; — E(R,)and R, — E(R,) are linearly dependent. Thus | p(R;, Re)| = 
1 if and only if there is a line L in the plane such that (R,(@), R,()) lies on L 
for “almost” all w, that is, for all @ outside a set of probability 0. 


3. Show that equality occurs in the Schwarz inequality, |E(R,R.))? = E(R,E(R.”), 
if and only if R, and R, are linearly dependent. 


4. Prove the following results. 
(a) Schwarz inequality for sums: For any real numbers a;,..., dn, b,,...5 dn; 
(21 a:b; < S21 a? D2, b?. 
(b) Schwarz inequality for integrals: If 1) 2 g(a) dx and f » ?(«) dx are finite, so is 
fi e@)h(x) dx, and furthermore (f? g(@)h(x) dx)? < [° 9°(x) dx [° h(x) da. 
HINT: show that both (a) and (b) are special cases of Theorem 2. 
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5. Show that if R,,..., R, are arbitrary random variables with E(R,’) finite for 
all i, then 
Var (Ry +--+: + R,) => Var R; +2> Cov (R,, Rj) 


a=1 t,j=1 
<j 


3.5 THE METHOD OF INDICATORS 


In this section we introduce a technique that in certain cases allows the 
expectation of a random variable to be computed quickly, without any 
knowledge of the distribution function. This is especially useful in situations 
when the distribution function is difficult to calculate. 

The indicator of an event A is a random variable I, defined as follows. 


L(o)=1 ifweA 
=0 ifw¢A 


Thus L, = 1if A occurs and 0 if A does not occur. (Sometimes J, is called the 

“characteristic function’ of A, but we do not use this terminology since we 

reserve the term “characteristic function’ for something quite different.) 
The expectation of I, is given by 


E(L,) = OP{L, = 0} + 1P{L, = 1} = P{l, = 1} =: P(A) 


The “method of indicators” simply involves expressing, if possible, a given 
random variable R as a sum of indicators, say, R=1,, +:°°:+1,. Then 


E(R) aes YE) => is (A;) 


Hopefully, it will be easier to compute the P(A;) than to evaluate E(R) 
directly. 


> Example 1. Let R be the number of successes in n Bernoulli trials, with 
probability of success p on a given trial; then R has the binomial distribution 
with parameters n and p; that is, 


PRR=k} = (7) ee 


We have found by a direct evaluation that E(R) = np (Problem 7, Section 
3.2), but the method of indicators does the job more smoothly. Let A; be 
the event that there is a success on trial i, i= 1,2,...,n. Then R= 


I4,+°:'+ +14, (note that I, may be regarded as the number of successes 
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on trial i). Thus 
E(R) = 2 EI 4,) = 2, PCA) = np 


Now, since A,,...,A, are independent, the indicators I,,...,I ‘A, 
are independent (Problem 1), and so there is a bonus, namely (by property 
7, Section 3.3), 


Var R = > Var I, 
7=1 
But I a = f A, hence 
E(14?) = E(4,) = P(A;) = p 
Therefore 
Var ly = E(L,?) — [EP by (3.3.1) 


= p — p> = p(l — p) 
Thus 


Var R = np(1 — p) < 


p> Example 2. A single unbiased die is tossed independently n times. Let 
R, be the number of 1’s obtained, and R, the number of 2’s. Find E(R,R,). 
If A, is the event that the ith toss results in a 1, and B; the event that the 

ith toss results in a 2, then 

R= to +1, 

R, = 1p, + °°: +4, 
Hence 

E(R,R2) = 2 Eats; 

t,jJ= 


Now if i A j, I4, and Ip, are independent (see Problem 1); hence 
E(L4,Ip,) = E(1,,)EUzp,) = P(A;)P(B;) = ae 


If i=j, A; and B, are disjoint, since the ith toss cannot simultaneously 
result in a 1 and a 2. Thus I, J, = 14, ,p, = 0 (see Problem 2). Thus 


nin — 1) 
ECR RK) = 
(R,R2) 36 
since there are n(n — 1) ordered pairs (i,j) of integers €{1, 2,...,m} such 
that i ¥ 7. 
Note that the [,Jp,, i,j = 1, ,n, are not independent [for instance, 


if Ly (@)Iz,(@) = 1, then I a (op, jw) must be OJ, so that we cannot compute 
the variance of R,R, in the same way as in Example 1. < 
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PROBLEMS 


1. If the events A,,..., A, are independent, show that the indicators I,,,..., 14, 
are independent random variables, and conversely. 

2. Establish Fa following properties of indicators: 
(a) Ig = I, =0 
(b) Lap = > Las Lgup = 14 + Ip — Lane 


(c) I,.6 i= +244, _ if the A, are disjoint 
i=17 7 


(d) If A,, Ag, .. . is an expanding sequence of events (A, © A,,, for all n) and 
U”, 4n = A, or if ca Ag, ... is a contracting sequence (A,,, | A, for 
all n) and N74 => A, then I, ie that iS, lim, + co I () = I,(@) 
for all o. 


3. In Example 2, find the joint probability function of R, and R». Notice how 
unwieldy is the direct expression for E(R, Ro). 


E(R,Re) = > jkP{R, =], R, = k} 
j,k=0 


4. In a sequence of v Bernoulli trials, let Ry be the number of times a success is 


followed immediately by a failure. For example, if n = 7 and w = (SSFFSFS), 
then R)(w) = 2, as indicated. Find E(R). 


5. Find Var R, in Problem 4. 


6. 100 balls are tossed independently and at random into 50 boxes. Let R be the 
number of empty boxes. Find F(R). 


3.6 SOME PROPERTIES OF THE NORMAL DISTRIBUTION 


Let R, be normally distributed with mean m and variance o”. 


e (e—m)"/20" 


fi(2) = coe 
If Re = aR, + b, a € 0, we shall show that R, is also normally distributed 
[necessarily E(R,) = am + b, Var R, = a*o* by properties 1, 2, and 6 of 
Section 3.3]. 

We may use the technique of Section 2.4 to find the density of Ry. R, = y 
corresponds to R, = h(y) = (y — b)/a. Thus 


fal) = f(H(y)) [hI = 7, bras : *) 
a [ (y — (am + a 


~ 2x \al o 2a?o? 
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so that R, has the normal density with mean am + 6 and variance a’o?. We 
may use this result in the calculation of probabilities of events involving a 
normally distributed random variable. If R has the normal density with 
E(R) = m, Var R = o*, then 


1 


Dae 


One must resort to tables to evaluate this. The point we wish to bring out is 
that, regardless of m and o?, only one table is needed, namely, that of the 
normal distribution function when m = 0, o? = 1; that is, 


oe (@—m)* /20° dx 


Pla<R<b}=[ 


et? dt 


r@=| 


«/— 0 TT 


For if R is normally distributed with E(R) = m, Var R = o*, then R* = 
(R — m)/o is normally distributed with E(R*) = 0 and Var R* = 1. Thus 


Pla <R <b} =P[S—" < Re <°— "I 
O O 


(te) —m (eS) 


A brief table of values of F* is given at the end of the book. 


REMARK. If a random variable has a density function f that is symmetrical 
about 0 [i.e., an even function: f(—2) = f(x)], then the distribution 
function has the property that F(—x) = 1 — F(x). For (see Figure 
3.6.1) 


F(—x) = P{R < —2} =| “f dt =| "40 dt 


= P{R> x} =1-— F(z) 


f(t) 


—=X x 


FIGuRE 3.6.1 Symmetrical Density. 
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In particular, the distribution function F* has this property, and thus 
once the values of F*(x) for positive x are known, the values of F*(x) 
for negative x are determined. 


PROBLEMS 


1. Let R be normally distributed with m = 1, o? = 9. 
(a) Find P{-.5 < R < 4} 
(b) If P{R > c} =.9, find c. 

2. If R is normally distributed and k is a positive real number, show that 
P{|R — m| = ko} does not depend on mor o; thus one can speak unambiguously 
of the “probability that a normally distributed random variable lies at least k 
standard deviations from its mean.” Show that when k = 1.96, the probability 
is .05. 


3.7 CHEBYSHEV’S INEQUALITY AND THE WEAK LAW 
OF LARGE NUMBERS 


In this section we are going to prove a result that corresponds to the physical 
statement that the arithmetic average of a very large number of independent 
observations of a random variable R is very likely to be very close to E(R). 
We first establish a quantitative result about the variance as a measure of 
dispersion. 


Theorem 1. 


(a) Let R be a nonnegative random variable, and b a positive real number. 
Then 


PrRoor. We first consider the absolutely continuous case. We have 


E(R) -{" xf p(x) dx = | afalo dx 


since R > 0, so that fp(x) = 0, x < 0. Now if we drop the integral from 0 
to b, we get something smaller. 


E(R) > { ” nf plt) dn 
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Since x > b, 
[fa dx > | ofatw dx = bP{R > b} 


This is the desired result. 

The general proof is based on the same idea. Let A, = {R > 5}; then 
R> RI,,. For if w € A,, this says simply that R(w) > 0; if we A,, it says 
that R(w) > R(w). Thus E(R) > E(RL,,). But RI, > bly, since we A, 
implies that R(w) > b. Thus 


E(R) > E(RL,,) > E1y,) = bEU4,) = bP(A,) 
Consequently P(A,) < E(R)/b, as desired: 


(b) Let R be an arbitrary random variable, c any real number, and « and 
m positive real numbers. Then 


ET |R — c|™ 
PUR — | > ey < HR 
E 


PROOF. 


PIR—cl>e}=P(IR—el">e7¥ <A R—) Faw 
E 


(c) If R has finite mean m and finite variance o? > 0, and k is a positive 
real number, then 


1 
P{|R — m| > ko} <i 


Proor. This follows from (b) with c = m, ¢e = ko, m = 2. 


All three parts of Theorem 1 go under the name of Chebyshev’s inequality. 
Part (c) says that the probability that a random variable will fall k or more 
standard deviations from its mean is < 1/k*. Notice that nothing at all is 
said about the distribution function of R; Chebyshev’s inequality is therefore 
quite a general statement. When applied to a particular case, however, it 
may be quite weak. For example, let R be normally distributed with mean m 
and variance o?. Then (Problem 2, Section 3.6) P{|R — m| > 1.960} = .05. 
In this case Chebyshev’s inequality says only that 


P{|R — m| > 1.960} < : 
) (1.96)* 
which is a much weaker statement. The strength of Chebyshev’s inequality 
lies in its universality. 

We are now ready for the main result. 
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Theorem 2. (Weak Law of Large Numbers). For eachn= 1, 2,..., 
suppose that R,, Ro,..., R, are independent random variables on a given 
probability space, each having finite mean and variance. Assume that the 
variances are uniformly bounded; that is, assume that there is some finite 
positive number M such that o? < M for all i. Let S, = >7_, R,. Then, 
for anye> 0, 

i 


Before proving the theorem, we consider two cases of interest. 


Sn — E(S,) 


n 


> o| +0 as n —> 00 


SPECIAL CASES 
1. Suppose that E(R,) = m for alli, and Var R; = o* for all 7. Then 


E(S,,) = > E(R,) = nm 


n n n 


Therefore, for any arbitrary e > 0, there is for large n a high probability 
that the arithmetic average and the expectation m will differ by <e. 

This case covers the situation when R,, R.,...,R, are independent 
observations of a given random variable R. All this means is that R,,..., Ry, 
are independent, and the R, all have the same distribution function, namely, 
Fp. In particular, E(R;) = E(R), so that for large n there is a high probability 
that (R, + °°: + R,)/n and E(R) will differ by <e. 

2. Consider a sequence of Bernoulli trials, and let R; be the number of 
successes on trial i; that is, R; = I, , where A; = {success on trial 7}. Then 
(R, +:::+R,)/n is the relative frequency of successes in v trials. Now 
E(R,;) = P(A,) = p, the probability of success on a given trial, so that for 
large n there is a high probability that the relative frequency will differ from 
p by <e. 


— MN 


PROOF OF THEOREM 2. By the second form of Chebyshev’s inequality 
[part (b) of Theorem 1], with R = (S, — E(S,))/n, c = 0, and m = 2, we 


have 
= 2 
H > | chef (=O) - 1 vars, 
€ 


n ne 
But since R,,..., R, are independent, 


n 


Var S, = > Var R,; by property 7 of Section 3.3 
i=1 


< nM by hypothesis 
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Thus 
n 


nM M 
ae ae 


ne NE 


0 


"| 
REMARK. If a coin with probability p of heads is tossed indefinitely, the 
successive tosses being independent, we expect that as a practical 
matter the relative frequency of heads will converge, in the ordinary 
sense of convergence of a sequence of real numbers, to p. This is a 
somewhat stronger statement than the weak law of large numbers, 
which says that for large n the relative frequency of heads in x trials 
is very likely to be very close to p. The first statement, when properly 


formulated, becomes the strong law of large numbers, which we shall 
examine in detail later. 


> «| < 


PROBLEMS 


1. Let R have the exponential density f(x) = e-*, x > 0; f(z) = 0,2 < 0. Evaluate 
P{|R — m| > ko} and compare with the Chebyshev bound. 


2. Suppose that we have a sequence of random variables R, such that P{R, = 
e”} = 1/n, P{R, = 0} =1 — 1/n,n =1,2,.... 
(a) State and prove a theorem that expresses the fact that for large n, R, is 
very likely to be 0. 
(b) Show that E(R,")— «© asn— o for anyk > 0. 


3. Suppose that R,, is the amount you win on trial n in a game of chance. Assume 
that the R; are independent random variables, each with finite mean m and finite 
variance o®. Make the realistic assumption that m < 0. Show that P{(R, + °°: 
+ R,)/n< m/2}— 1 asn— oo. What is the moral of this result? 


Conditional Probability 
and Expectation 


4.1 INTRODUCTION 


We have thus far defined the conditional probability P(B|A) only when 
P(A) > 0. However, there are many situations when it is natural to talk 
about a conditional probability given an event of probability 0. For example, 
suppose that a real number R is selected at random, with density f If R 
takes the value x, a coin with probability of heads g(x) is tossed (0 < g(%) < 
1). It is natural to assert that the conditional probability of obtaining a head, 
given R = &, is g(#). But since R is absolutely continuous, the event {R = 2} 
has probability 0, and thus conditional probabilities given R = x are not as 
yet defined. 

If we ignore this problem for the moment, we can find the over-all prob- 
ability of obtaining a head by the following intuitive argument. The prob- 
ability that R will fall into the interval (7, x + dz] is roughly f(a) dx; given 
that R falls into this interval, the probability of a head is roughly g(x). Thus 
we should expect, from the theorem of total probability, that the probability 
of a head will be >, g(x)f(x) dx, which approximates {®,, g(x)f(x) dx. Thus 
the probability in question is a weighted average of conditional probabilities, 
the weights being assigned in accordance with the density £ 
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(a) 


FiGure 4.1.1 


Let us examine what is happening here. We have two random variables 
R, and R, [R, = R, R, = (say) the number of heads obtained]. We are 
specifying the density of R,, and for each x and each Borel set B we are 
specifying a quantity P,(B) that is to be interpreted intuitively as the con- 
ditional probability that R,¢ B given that R, = x. (We shall often write 
P{R, € B| R, = 2x} for P,(B).) 

We would like to conclude that the probabilities of all events 
involving R, and R, are now determined. Suppose that C is a_ two- 
dimensional Borel set. What is a reasonable figure for P{(R,, R.) € C}? 
Intuitively, the probability that R, falls into (7, x + dz] is f,(x) dx. Given 
that this happens, that is, (roughly) given R, = x, the only way (R,, Re) 
can lie in C is if R, belongs to the “‘section’’ C, = {y: (a, y) € C} (see Figure 
4.1.1a). This happens with probability P,(C,). Thus we expect that the total 
probability that (R,, R,) will belong to C is 


In particular, if C= Ax B= {(a,y): eA, y € B} (see Figure 4.1.1b), 


C,= @ if ¢ A; C=8 ifzeA 
Thus 


P{(R,, R) € C} = P{R, € A, R, € B} =| PB fCe) dx 
A 


The above reasoning may be formalized as follows. Let Q = E?, F = 
Borel subsets, R,(x, y) = 2, R(x, y) = y. Let f, be a density function on E’, 
that is, a nonnegative function such that J@,, f,(%) de = 1. Suppose that for 
each real x we are given a probability measure P, on the Borel subsets of E. 
Assume also that P,(B) is a piecewise continuous function of x for each fixed 
B. 


Then it turns out that there is a unique probability measure P on F such 
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that for all Borel subsets A, B of Et 


P(A x B) = | P,(B) f(x) da (4.1.1) 
A 


Thus the requirement (4.1.1), which may be regarded as a continuous version 
of the theorem of total probability, determines P uniquely. In fact, if Ce F, 
P(C) is given explicitly by 


P(C) =|" PLC of (e) ae (4.1.2) 


Notice that if R,(z, y) = w, Ro(#, y) = y, then 


P(A x B) = P{R, € A, R, € B} 
and 
P(C) = P{(R,, Re) € C} 


Furthermore, the distribution function of R, is given by 


where A = (— ©, 2%], B = (— ©, ©) 


= | PB fa) de = |" fw) de 
Thus f, is in fact the density of R,. Notice also that 
P{R, € B} = P{R, € A, R,€ BY 


where A = (— 00, 0); hence 
P{R, € B} = { ” p(B) f(x) dx (4.1.3) 


To summarize: If we start with a density for R, and a set of probabilities 
P,(B) that we interpret as P{R, € B| R, = x}, the probabilities of events of 
the form {(R,, R.) € C} are determined in a natural way, if you believe that 
there should be a continuous version of the theorem of total probability; 
P{(R,, Re) € C} is given explicitly by (4.1.2), which reduces to (4.1.1) in the 
special case when C = A x B. 

We have not yet answered the question of how to define P{R, € B | R, = %} 
for arbitrarily specified random variables R, and R,; we attack this problem 
later in the chapter. Instead we have approached the problem in a somewhat 
oblique way. However, there are many situations in which one specifies the 
density of R,, and then the conditional probability of events involving Re, 
given R, = x. We now know how to formulate such problems precisely. 
Consider again the problem at the beginning of the section. If R, has density 
f, and a coin with probability of heads g(x) is tossed whenever R, = x (and 
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a head corresponds to R, = 1, a tail to R, = 0), then the probability of 
obtaining a head is 


P{R, = 1} =|" P{R, =1|R, = x}f,(x) dx __ by (4.1.3) 


= [ ” ox) f(a) dx 


in agreement with the previous intuitive argument. 


4.2 EXAMPLES 

We apply the general results of this section to some typical special cases. 

> Example 1. A point is chosen with uniform density between 0 and 1. If 
the number R, selected is x, then a coin with probability 2 of heads is tossed 
independently n times. If - is the resulting number of heads, find p.(k) = 


P{R, = k}, k = 0,1, 
Here we have Hom = 1, 0 <x < 1; f,(«) = 0 elsewhere. Also 


P,{k} = P{R, = k.| Re x} = (7) a1 — x)" 


P{R, = k} = i‘ () x1 — 2)" da 


This is an instance of the beta function, defined by 


By (4.1.3), 


1 
f(r, s) = | el gyda, r,s >0 
0 
It can be shown that the beta function can be expressed in terms of the gamma 
function [see (3.2.2)] by 
Pols) 


a ee 


(4.2.1) 
(see Problem 1). Thus 


p(k) = (7) aCe i ee | 


_— (n\b&+drnm—-—k +) 
a T(n + 2) 


! a ! 
sf a re 
k} (n+ 1)! n+1 
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> Example 2. A nonnegative number R, is chosen with the density fi(v) = 
xe*,x > 0;f,\(%) = 0,2 < 0. If R, = x, a number R, is chosen with uniform 
density between 0 and z. Find P{R, + R, < 2}. 

Now we must have 0 < R, < R,; hence, if 0 < R, < 1, then necessarily 
R, +R, <2. If 1<R, <2, then Ri +R, <2 provided that R, < 
2 — R,. If R, > 2, then R, + R, cannot be <2. By (4.1.2), 


P{R, + R, < 2} 
=| we *P{R, + Ro < 2 | R, = x} dz 
0 
1 2 ; ore) 
=| xe (1) dx +| wé *P{R, <2 — x | R, = «\dx +| xe *(0) dx 
0 1 2 
Given R, = x, R, is uniformly distributed between 0 and ; thus 
2—2 
P{R, <2—2|R, = x} = , lee <2 
x 
(see Figure 4.2.1). Therefore 


1 2 
P{R, + R, < 2} -| xe dx +| we 
0 1 


2—2 


)dz= 1-2 +e <4 


> Example 3. Let R, be a discrete random variable, taking on the values 
%, %,... With probabilities p(7,), p(%.),.... If R, = 2,, a random variable 
R, is observed, where R, has density f,. What is P{(R,, R,) € C}? 

This is not quite the situation we considered in Section 4.1, since R, is 
discrete. However, the theorem of total probability should still be in force. 
R, takes the value x; with probability p(%,); given that R, = x,, the prob- 
ability that R, € B is P,.(B) = Jp f(y) dy. Thus we should have 


PR, €A, Ry eB} = ¥ pla) : fy) dy (4.2.2) 


Ficure 4.2.1 Conditional Probability Calculation. 
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and, more generally, 
P{(Ry, Re) © C} = D P(x) ; Tily) dy (4.2.3) 


In fact, if we take Q = E?, # = Borel sets, R,(%, y) = x, R(x, y) = y, 
it turns out that there is a unique probability measure on F satisfying (4.2.2) 
for all Borel subsets A, B of E'; P is given explicitly by (4.2.3). < 


PROBLEMS 


1. Derive formula (4.2.1). HINT: in T(r) = _ fe coo ¢?—le-t dt, let t = x”. Then write 
I(r)'(s) as a double integral and switch to polar coordinates. 

2. In Example 2, what are the sets C and C, in (4.1.2)? What is P,(C,)? 

3. In Example 3, suppose that R, takes on positive integer values 1,2,... with 
probabilities py, po, ... (p; = 9, De _Pi = 1). If R, =n, Ry is selected according 
to the density f,(v%) =ne-™, « >0; f,(%) =0, « <0. Find the probability 
that 4 < R, + R, < 6. 

4. In Example 3 we specified P, (B) to be interpreted intuitively as the probability 
that R, € B, given that R, = x,. This, plus the specification of p(~;),i = 1, 2,. 
determines the probability measure P. Use (4.2.2) to show that if p(~,) > 0 then 
P{R, € B| R, = x,;} = P,,(B), thus justifying the intuition. In order words, the 
conditional probability as computed from the probability measure P coincides 
with the original specification. 

5. A number R, is chosen with density f,(z) = 1/z*, x > 1; fA@) =0, « <1. If 
R, = %, let R, be uniformly distributed between 0 and x. Find the distribution 
and density functions of Rp. 
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We have seen that specification of the distribution or density function of a 
random variable R,, together with P,(B) (for all real x and Borel subsets B 
of E+), interpreted intuitively as the conditional probability that R, € B, 
given R, = x, determines the probability of all events of the form {(R,, R2) € 
C}. However, this has not resolved the difficulty of defining conditional 
probabilities given events of probability 0. If we are given random variables 
R, and R, with a particular joint distribution function, we can ask whether 
it is possible to define in a meaningful way the conditional probability 
P{R, € B| R, = x}, even though the event {R, = x} may have probability 
0 for some, in fact perhaps for all, x. We now consider this question in the 
case in which R, and R, have a joint density f. 
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A reasonable approach to the conditional probability P{R, € B| R, = 2} 
is to look at P{R, € B| % —h < R, < % + hj and let h > 0. Now 


Loth 
Pity —h< Ri <% +h, Ree B} =| [ fe vdy ae 
Lo—h JB 


which for small # should look like 2h» f(x, y) dy. But P{a—-h< R, < 
ty + h} looks like 2h f,(%) for small h, where f,(x) = [@., f(#, y) dy is the 
density of R,. Thus, as h — 0, it appears that under appropriate conditions 
P{R,€ B| x —h< R, < +h} should approach fz [f(#, y)/f(@)] dy, so 
that we find conditional probabilities involving R,, given R, = x, by inte- 
grating f(x, y)/f,(x) with respect to y. 

We are led to define the conditional density of R, given R, = x (or, for short, 

the conditional density of R, given R,) as 
hy | 2) = Le” (4.3.1) 
fi(*) 
Since {~,, f(x, y) dy = f,(x) (see Section 2.7), we have fr. hy | x) dy = 1, 
so that h(y | x), regarded as a function of y, is a legitimate density. 

Notice that the conditional density is defined only when f,(x) > 0. How- 
ever, we may essentially ignore those (x, y) at which the conditional density 
is not defined. For let S = {(x, y): f(x) = 0}. We can show that P{(R,, Rs) € 
S} = 0. 


P(RuR)ES}=[[ fe ydedy=| |” se dvae 
S 


=| fwae=o 
{x:f1(2)=0} 


We define the conditional probability that R, belongs to the Borel set B, 
given that R, = 2, as 


P,(B) = P{R,€B|R, = 2} = [ h(y | x) dy (4.3.2) 


We can ask whether this is a sensible definition of conditional probability. 
We have set up our own ground rules to answer this question: “‘sensible’’ 
means that the theorem of total probability holds. Let us check that in fact 
(4.1.1) [and hence (4.1.2)] holds. We have 


P{R, € A, R, € B} =| | sewacay 
LE YE 


_ é cl | ee iy ae [, P,(B) f,() dx 
which is (4.1.1). 5 ye 
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We have seen that if (R,, R,) has density f(z, y) and R, has density f,(~) 
we have a conditional density h(y | x) = f(x, y)/f,(%) for Ro, given R, = x. 
Let us reverse this process. Suppose that we observe a random variable R, 
with density f,(x); if R, = 2, we observe a random variable R, with density 
h(y | x). If we accept the continuous version of the theorem of total prob- 
ability, we may calculate the joint distribution function of R, and R, using 
(4.1.1). 


F(%, Yo) = P{R, < %, Re < Yo} =| P{R, < Y% | R, = x} f,(2) Ax 


=["[ [awl av]ic@ar=[" [” peony |» dy ae 
Thus (R,, R.) has a density given by f(z, y) = fi(mhlty | az), in agreement with 
Oe ieee We may look at the formula f(a, y) = fi(x)h(y | x) in two 
wT If (R,, R,) has density f(z, y), we have a natural notion of conditional 
probability. 


P,(B) = P{R,€ B| R, = 2} = | hey | x) dy 


2. If R, has density f,(”), and whenever R, = x we select R, with density 
h(y | x), then in the natural formulation of this problem (R,, R,) has density 
f(%, y) = fi(@)hty | x). 

In both cases “‘natural’’ indicates that (4.1.1), the continuous version of the 
theorem of total probability, is required to hold. 

We may extend these results to higher dimensions. For example, if (R,, Re, 
R;, Rg) has density f (21, 22, %3, 24), we define (say) the conditional density of 
(R3, Ry) given (R,, Rz), as 
f(®1, ©, £3, X4) 


fia( #1, Xe) 


h(%3, XL, | %, Ly) = 
where 


fia(%1, V2) -| { f (4, %_, %3, 4) dxz dx, 


The conditional probability that (R3, R,) belongs to the two-dimensional 
Borel set B, given that R, = x,, R, = %g, is defined by 


Po.n(B) = P{(R3, Ry) € B | R, = 2, R, = 2,} 


= | [i Ly | Ly, %) dx, dx, 
B 
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The appropriate version of the theorem of total probability is 
P{(Riy Ra) © A, (Ro, Ro) BY = [| Pose B) ful %) dy de 
A 


If (R,, R.) has density f,2(%,, 2), and having observed R, = 2%, Ry = %, 
we select (Rg, Ry) with density h(%g, x4 | 21, %,), then (R,, Re, Rs, Ry) must 
have density f(%1, %2, Xs, X4) = fio(%1, La)h(Lg, Xa | 1, Xe). 

Let us do some examples. 


> Example 1. We arrive at a bus stop at time ¢ = 0. Two buses A and B 
are in operation. The arrival time R, of bus A is uniformly distributed be- 
tween 0 and ¢, minutes, and the arrival time R, of bus B is uniformly distrib- 
uted between 0 and fg minutes, with t, < tp. The arrival times are 
independent. Find the probability that bus A will arrive first. 

We are looking for the probability that R, < R,. Since R, and R, are 
independent (and have a joint density), the conditional density of R, given 
R, is 

Ce ee 
fi(2) 


If bus A arrives at 7,0 < x < ty, it will be first provided that bus B arrives 
between x and fg. This happens with probability (tg — x)/t,. Thus 


5) Oc y<tz 


P{R, <R,|Ri= 2} =1——, O<x<t, 
B 
By (4.1.2), 


P{R, < R,} = [ ” PLR, < Ry | Ri = 2} f(x) dx 


tA 
=| (1 - 2)" ae =1— 34 
[Formally, taking the sample space as E*, we have C= {R, < Rj} = 
{(z, y): ©<y}, C, = {y: 4 < y}, P,(C,) = P{R, < Re] R, = 27} = 1 —- 
t/tp,O<e¢ < ty] 
Alternatively, we may simply use the joint density: 


P{R, < Ry} = | | f(a, y) dx dy 
x<y 
= the shaded area in Figure 4.3.1, divided by the total area t ,tp 
ye La ae 


tate 2tp 
as before. < 
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tz 


» 


x 
0 ty 


FiGureE 4.3.1 Bus Problem. 


> Example 2. Let Ry) be a nonnegative random variable with density 

fo(A) = e*, A>O0. If Ry =A, we take nm independent observations R,, 

R,,..., R,, each R; having the exponential density f,(y) = Ae~*’, y > 0 

(= Ofor y < 0). Find the conditional density of Ry given (R,, Ro, ..., Rn). 
Here we have specified /o(A), the density of Ry, and the conditional density 

of (R,, Re, ..., R,) given Ry, namely, 

DG Bon 2 | A) = f(%1) fi(%e) °° * fa( £2) by the independence 

assumption 


Se, BDz 
t=1 
The joint density of Ro, R,,..., R, is therefore 


FAs Bip os os Ba) = ANG, «2. q | A) = ae 
The joint density of R,,..., R, is given by 


os a ee: | =|" f(A, 2, ..., %,) da -| Nee rg), 


n! 
= (with y = AQ + 0) | al = (1+ 2)" 


Thus the conditional density of Ro given (R,,..., R,) is 


h(a | ise & tected Ln) _ fA, ty +s Bn) _ 1 eee goatice (| aa rth 
g(%, ce ty Ln) n! 
Nig Big sours ge OS eS ep ees ee 4 


PROBLEMS 


1. Let (R,, Re) have density f(z7,y) =e", O< xu <y, f(x,y) =0 elsewhere. 
Find the conditional density of Rp given R,, and P{R, <y|R, =}, the 
conditional distribution function of R, given R, = &. 
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2. Let (R,, Re) have density f(z, y) =k |x|, -1 << *<1,-1 <y<2;f@,y) =0 
elsewhere. Find k; also find the individual densities of R, and R,, the conditional 
density of R, given R,, and the conditional density of R, given Rg. 

3. (a) If (R,, R) is uniformly distributed over the set C = {(a, y): 22 + y? < 1}, 

show that, given R, = 2, R, is uniformly distributed between —(1 — x?)1/2 
and +(1 — 2?)1/2, 

(b) Let (R,, R,) be uniformly distributed over the arbitrary two-dimensional 
Borel set C [i.e., P(B) = (area of B A C)/area of C (= area B/area C if 
Bc OC). 

Show that given R, =~, R, is uniformly distributed on C, = {y: (a, y) € C}. 

In other words, h(y | x) is constant for y € C,, and 0 for y € C,,. 

4. In Problem 1, let R; = R, — R,. Find the conditional density of Rg given R,; = x. 
Also find Pf < Ry <2|R, = 4}. 

5. Suppose that (R,, R.) has density f and R, = oth, R,). You are asked to 
compute the conditional distribution function of R3, given R, = 2; that is, 
P{R3 < z| R, = x}. How would you go about it? 


4.4 CONDITIONAL EXPECTATION 


In the preceding sections we considered situations in which two successive 
observations are made, the second observation depending on the result of 
the first. The essential ingredient in such problems is the quantity P,(B), 
defined for real x and Borel sets B, to be interpreted as the conditional prob- 
ability that the second observation will fall into B, given that the first observa- 
tion takes the value x: for short, P{R, € B| R, = x}. In particular, we may 
define the conditional distribution function of R, given R, = x, as F,(y | 2) = 
P{R, < y| R, = 5}. 

If R, and R, have a joint density, this can be computed from the con- 
ditional density of Rz given Ry: Fy(yo| 2) = J¥.. A(y| x) dy. 

In any case, for each real x we have a probability measure P,, defined on the 
Borel subsets of £1. Now if R, = x and we observe R,, there should be an 
average value associated with R,, that is, a conditional expectation of R, 
given that R, = x. How should this be computed? Let us try to set up an 
appropriate model. We are observing a single random variable R,, so let 
Q = E', ¥ = Borel sets, R.(y) = y. We are not concerned with the prob- 
ability that R,¢ B, but instead with the probability that R,¢ B, given 
that R, = «x. In other words, the appropriate probability measure is P,. 
The expectation of R,, computed with respect to P,, is called the conditional 
expectation of R, given that R, = x (or, for short, the conditional expectation 
of R, given R,), written E(R,| R, = 2). 

Note that if g is a (piecewise continuous) function from E1 to E’, then g(R,) 
is also a random variable (see Section 2.7), so that we may also talk about 
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the conditional expectation of g(R,) given R, = x, written E[g(R,) | R, = 2]. 
In particular, if there is a conditional density of R, given R, = 2, then, 
by Theorem 2 of Section 3.1, 


Efe(R,)| Ry = 2] = { ” e(yph(y | 2) dy (4.4.1) 


if g > 0 or if the integral is absolutely convergent. 
There is an immediate extension to n dimensions. For example, if there is a 
conditional density of (Ry, R;) given (R,, Re, Rs), then 


E[g(Ra, Rs) | R, = %,, Ro = 22, Rg = 5] 
= [ { 2(X4, L5)h(X4, X; | ©1, 2, Ly) dx, dx; 


Note also that conditional probability can be obtained from conditional 
expectation. If in (4.4.1) we take g(y) = Ip(y) = 1 if ye B, and = 0 if 
y ¢ B, then 


Efg(R,)| R, = 2] = Elp(R) | Ry = 2) = [ ” Ta(h(y | 2) dy 
=| my|2 dy = P{R, € B| R, = 2} 
B 


We have seen previously that P{R, € B} = Ell;z.p;]. We now have a 
similar result under the condition that R, =z. [Notice that Ip(R,) = 
I-p,epy3 for Ip(Ro(m)) = 1 iff Ro(@) € B, that is, iff Ijp,.-—(@) = 1.] 

Let us consider again the examples of Section 4.2. 


> Example 1. , is uniformly distributed between 0 and 1; if R, = 2, 
R, is the number of heads in x tosses of a coin with probability x of heads. 

Given that R, = x, R, has a binomial distribution with parameters n and a: 
P{R, = k| Ry = x} = (j)x*(1 — x)"-*. It follows that E(R,| Ry = 2) is the 
average number of successes in ” Bernoulli trials, with probability x of 
success on a particular trial, namely, nz. < 


> Example 2. R, has density fi(v) = xe”, x > 0, f(x) = 0, x < 0. The 
conditional density of R, given R, = x is uniform over [0, x]. It follows 
that, for x > 0, 


[e,@) x 1 
B(Re| R= 2) =[" yh(y| 2) dy = [y= dy = de 
Similarly, 
Efe™| R, = 2] =| e"h(y | x) dy =| ei dy = a 
—00 0 Hb x 
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> Example 3. 2, is discrete, with p(x,) = P{R, = x}, i = 1, 2,.... Given 
R, = «,, R, has density /;; that is, 

PIR,EB|Ri= 2} =| fo) dy 
Thus 


Ele(R,) | Ri = =] = { ” ay) fy) dy « 


Now let us consider a slightly different case. 


> Example 4. Let R, and R, be discrete random variables. If R, = x, then 


R, will take the value y with probability 
“zy 
rly | 2) = P{R, = y| Ry = 2} = Bah ) 


py(2) 
where 


Pro(%, Y) = PLR, = 2, Rp = y}, p(x) = P{R, = x} 


ply | x), which is defined provided that p,(~) > 0, will be called the con- 
ditional probability function of Rz given R, = x (or the conditional prob- 
ability function of R, given R,, for short). We may find the probability that 
R, € B given R, = x by summing the conditional probability function. 


PIR = = Ree B} gyre 
P{R, = #5 7 P,(2) 
= >. p(y | 2) 


P,(B) = P{R, €B| R, = 2} = 


Thus, given that R, = x, the probabilities of events involving R, are found 
from the probability function p(y | x), y real. Therefore the conditional 
expectation of g(R,) given R, = x Is 


E{g(Rs)| Ri = x] = > g(y)ply | 2) (4.4.2) 
In particular, : 
E(R; | Ri = 2%) = X yp(y| 2) « 


There is a feature common to all these examples. In each case the ex- 
pectation of R, (or of a function of R,) can be expressed as a weighted average 
of conditional expectations. Let us look at Example 4 first. With probability 
p(x), R, takes the value x; if R, = x, the average value of R, is E(R,| R, = 
az). By analogy with the theorem of total probability, it is reasonable to ex- 
pect that 


E(R2) = > Pi(w)E(Re | R, = 2) 
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To justify this, write 


E(Rz) = > yPely) = > yP{Re =y}= D>Y> P{Ri = 2, R, = y} 
: 4 oe by (2.7.2) 


a > yP{R, = e}P{R, = y | R, = z= > Pld yoy | x)| 


This is the desired result. 

In Example 1 the probability that R, will lie in an interval about = is 
f,(«) de = dx; given that R, = x, the average value of R, is E(R, | Ry = *) = 
nx. We expect that 


BUR) = | fi(@E(Re| Ri = 2) deo 


To verify this, notice that we calculated in Section 4.2 that 


1 
P{R, =k} = ; co | a ee 
{Re } ead nA 
Thus 
Z 1 1 (n~+i1)n n 
E(R,) = > kP{R, = k} = — (14+2+4+-°°-4+n”) = — —— = - 
(Ro) = 2 KP{Rs oa erere ag ame 5 5 
But 


= 1 
ae — 0 


In Example 2, the joint density of R, and R, is 
fey) =f@hy|2)=~—=e% 2>0,0<y<o 
Now 
E(Re) = | [ yf (x, y) du dy 


[Notice that we need not compute f,(y) explicitly; instead we simply regard 
R, as a function of R, and R,; that is, we set g(R,, Re) = R, and compute 


Ble R=] [ats fle, ») de dy) 
Thus 
E(R>) = | cas [ i. ay | ie { “pate-® de = 4T(3) = 1 
But 
| AER, Rie | " ee-*(4x) de = 1 
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In Example 3 we have [see (4-2.2)] 


PER, € B} = ¥ wl) | ho) ay =|, S wlasie)| ay 
so that R, has density 
fay) = > P(*)FLY) (4.4.3) 
Thus 


(o@) 


B(Rs) = [ wfalo) dy = ¥ ple) |" wha) dy 


and consequently 
E(R2) = > P(u)E(Re | R, = %) 
as expected. 
Results of the form 


E(Rz) = > p(%,)E(R2| Ri = %) (4.4.4) 
or 


E(R,) = . fi(x)E(R, | Ry = x) dx (4.4.5) 


are called versions of the theorem of total expectation. 

In the situations we are considering, conditional expectations are derived 
ultimately from a given set of probabilities P,(B) = P{R, € B| Ry = 2}. 
In such cases it turns out that if E(R,) exists, (4.4.4) will hold if R, is discrete, 
and (4.4.5) will hold if R, is absolutely continuous. 

Notice that E(R. | R, = &) will in general depend on x and hence may be 
written as (x); f@.. g(a)f,(a) dx in (4.4.5) [or >, g(x)p(x) in (4.4.4)] is then 
the expectation of g(R,). Thus (4.4.4) and (4.4.5) may be rephrased as 
follows. 

The expectation of the conditional expectation of Rz given R, is the (over-all) 
expectation of Rg. 


> Example 5. Let R be a random variable with the distribution function 
shown in Figure 4.4.1. Find E(R?°). 


F(x) 


—1 0 1 2 3 
Ficure 4.4.1 
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If R were discrete we would compute 
E(R®) = ¥ 2"pp(2) 


and if R were absolutely continuous we would compute 


BR) =|" atfefa) de 


In this case, however, R falls into neither category. We are going to show how 
to use the theorem of total expectation to compute E(R’). 

We have P{R = —1} = 1/4, P{R = 2} = 3/4 — 1/4 = 1/2, P{R = 2} = 0 
for other values of x. Let F, be a step function that is 0 forz < —1 and hasa 
jump of 1/4 at x = —1 and a jump of 1/2 at x = 2. Subtract F, from F to 
obtain a continuous function F, that can be represented as an integral of a 
nonnegative function fg. F, is called the “discrete part” of F, and F, the 
“absolutely continuous part’’ (see Figure 4.4.2). F, and F, are monotone, 
right-continuous functions, and they approach zero as « —» — oo. However, 
they approach limits that are less than 1 as x — o, so that they cannot be 
regarded as distribution functions of random variibles: However, (4/3)F; 
and 4F, are legitimate distribution functions. 

We shall show that 


E(R*) = 2 2° Dp(2) +]. Bla) dx 


Consider the following random experiment. With probability 3/4 (= 
F,(00) = >, pr(x), where pp(x) = P{R = x}), pick a number in accordance 


Fy (x) Fo(x) 


—1 2 2 3 
h(x) = FZ 


P,(x) = P(R = x} Fate) 


-1 0 2 


FIGURE 4.4.2 Discrete and Absolutely Continuous Parts of a Distribution Function. 
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N=-1 


N uniformly distributed between 
2 and 3 


L 
4 


FiGureE 4.4.3 Tree Diagram for Example 5. 


with (4/3)F,; that is, pick —1 with probability 1/3 and 2 with probability 
2/3. With probability 1/4 [= F,(0o)], pick a number in accordance with 
F,, that is, one uniformly distributed between 2 and 3 (see Figure 4.4.3). 
If N is the resulting number, then, by the theorem of total probability, 
P{N < x} = P(A)P{N < | A} + P(B)P{LN < x| B} 


where A and B correspond to the two possible results at the first stage of the 
experiment. Thus 


Fy(%) = #G4,@)) + 24Fi(@)) = A@) + Fe) = F@) 
Therefore Fy is the original distribution function F. 
Since NV and R have the same distribution function, we expect that E(V*) = 
E(R®). Now we may compute E(V*) by the theorem of total expectation. 


E(N*) = P(A)E(N® | A) + P(B)E(N?® | B) 


3 
=H + 8a) +4] Pde 38 + =e 
Notice that this may be expressed as 


3 
(1 +244] ath de 
2 
that is, 


BUR) = Yap) + | aha) da 
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More generally, the expectation of a function of R may be computed by 


Ble(R)) = eal) + |" ete) fle) de (4.4.6) 
if g > 0 or if both the series and the integral are absolutely convergent. < 


> Example 6. Let R be a random variable on a given probability space, 
and A an event with P(A) > 0. Formulate the proper definition of the con- 
ditional expectation of R, given that A has occurred. 

This actually is not a new concept. If I, is the indicator of A, we are look- 
ing for the expectation of R, given that I, = 1. Let the experiment be per- 
formed independently times, n very large, and let R,; be the value of R 
obtained on trial i, i = 1, 2,...,m. Renumber the trials so that A occurs on 
the first k trials, and A° on the last n — k [k will be approximately nP(A)]. 
The average value of R, considering only those trials on which A occurs, is 


Ro ER, -(: 


R,I 
: ER)" 


WN j=1 

where J; = 1 if A occurs on trial 7; I; = 0 if A does not occur on trial 
j. In other words, J; is simply the jth observation of I,. It appears that 
1/n >”, Ril; approximates the expectation of RI,; since k/n approximates 
P(A), we are led to define the conditional expectation of R given A as 


E(R| A) = ae if P(A) > 0 (4.4.7) 


Let us check that (4.4.7) agrees with previous results when R is discrete. By 
(4.4.2), 

E(R | 1, = 1) = 2 yP{R = y|1L,= 1} = DyP{R = y| 14 = 1} 
But if y 4 0, 


PIR = yl = 1 P{RI,=y 
Perego ee P{RI, = y} 4=¥} 


P{L, = 1} P(A) 
Thus 
E(RI 
E(R | 14 = 1) = eave {Rl, = y} = a 


+ The reader may recognize this as the Riemann-Stieltjes integral {(°,, ¢(x) dF(x). Alterna- 
tively, if one differentiates F formally to obtain f = f, plus “‘impulses” or “‘delta functions” 
at —1 and 2 of strength 1/4 and 1/2, respectively, and then evaluates (™,, ¢(x) f(a) da, (4.4.6) 
is obtained. 
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Let us look at another special case. For any random variable R and event 
A with P(A) > 0, we may define the conditional distribution function of R 
given A in a natural way, namely, 
P(A NO {R < 23) 
P(A) 
Now assume that R has density fand A is of the form {R € B} for some Borel 
set B. Then 


F,(| A) = P{R < «| A} = (4.4.8) 


P(A O{R < %}) = P{REB,R < x} -| f(x) dx 


x< 20 


=| _ @Ig(x) dx 


Thus (4.4.8) becomes 
*0 f(x) 
F,(x, | A) = ~—-~— [,(x) dx 
al | A) =|" 22 ta) 
In other words, there is a conditional density of R given A, namely, 


fre | A) = ©. Ly = L@ ifxeB 


P(A) “~" P(A) 
= 0 ifvéB (4.4.9) 
We may then compute the conditional expectation of R given A. 
E(R A= af p(x A) dx = | lel AS x) dx 
(R| A) __tr(# | A) «P(A 1 
E(RI,(R 
= E(RE(R) by Theorem 2 of Section 3.1 
P(A) 
But : , 
I,(R) = Icpep} by the discussion preceding Example 1 
=I, 
Thus 
E(RI 
E(R | A) = E(RI 4) 
P(A) 


in agreement with (4.4.7). 


REMARK. (4.4.8) and (4.4.9) extend to mn dimensions. The conditional 
distribution function of (R,,...,R,) given A is Fye..,(%,.--; 
ty, | A) = P{R, < %,..., Ry, <%,| A} If (Ry... Ry) has 
density fand A = {(R,,..., R,) € B}, there is a conditional density of 
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(R,,..., R,) given A. 


f(y +++» Bn) : " 
Fpl... %,_| A) = P(A) Ip(%,..-; n) 


The argument is essentially the same as above. < 


PROBLEMS 


1. 


Let ¢R,, R,) have density f(z, y) = 8ay,0 <y < « <1; f(@, y) = Oelsewhere. 

(a) Find the conditional expectation of R, given R, = x, and the conditional 
expectation of R, given Ry = y. 

(b) Find the conditional expectation of R,* given R, = «. 

(c) Find the conditional expectation of R, given A = {R, < 1/2}. 


. In Example 2 of Section 4.3, find the conditional expectation of Ry”, given 


R, = %,..., Rn = Xn. 


. Let (R,, Re) be uniformly distributed over the parallelogram with vertices 


(0, 0), (2, 0), (3, 1), (1, 1). Find E(R2| R, = 2). 


. If a single die is tossed independently n times, find the average number of 2’s, 


given that the number of 1’s is k. 


. Let R, and R, be independent random variables, each uniformly distributed 


between 0 and 2. 
(a) Find the conditional probability that R, > 1, given that R, + R, < 3. 
(b) Find the conditional expectation of R,, given that R, + R, < 3. 


. Let B,, B,,... be mutually exclusive, exhaustive events, with P(B,) > 0, 


n=1,2,..., and let R be a random variable. Establish the following version 
of the theorem of total expectation: 


E(R) = > P(B,)E(R | B,) 
n=1 


[if ECR) exists]. 


. Of the 100 people in a certain village, 50 always tell the truth, 30 always lie, 


and 20 always refuse to answer. A single unbiased die is tossed. If the result is 
1,2, 3, or 4, a sample of size 30 is taken with replacement. If the result is 5 or 6, 
a sample of size 30 is taken without replacement. A random variable R is defined 
as follows: 

R = 1 if the resulting sample contains 10 people of each category. 

R = 2 if the sample is taken with replacement and contains 12 liars. 

R = 3 otherwise. 
Find E(R). 
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8. Let R, and R, be independent random variables, each uniformly distributed 
between 0 and 1. Define 


Ry =g(R,,R) =R, if RP +R? <1 
R,=2 ifR2+R2>1 


(a) Find F,(z) and compute E(R;) from this. 
(b) Compute E(R;) from f° f° g@, Wfhiel@, y) dx dy. 
(c) Compute E(R,| R,? + R22 <1) and E(R,| Ry? + R.2 > 1); then find 
E(R3) by using the theorem of total expectation. 
9. The density for the time T required for the failure of a light bulb is f(x) = 
je, « > 0. Find the conditional density function of T — ft, given that 
T > to, and interpret the result intuitively. 


10. Let R, and R, be independent random variables, each uniformly distributed 
between 0 and 1. Find the conditional expectation of (R, + R,)" given R, — Re. 


11. Let R, and R, be independent random variables, each with density f(x) = 
(1/2)e*, « > 0; f(@) = 1/2, -1 < « <0; f@) =0,2 < —1. Let R, = R? + 
R,”. Find E(R, | R, = 2). 

12. Let R, be a discrete random variable; if R, =, let R, have a conditional 
density A(y | “). Define the conditional probability that R, = given that 
R, = y as 


P{R, = xhy | x) 
Pik = 81 Ra 8 SRR = ay |= 
(cf. Bayes’ Theorem). : 
(a) Interpret this definition intuitively by considering P{R, =x|y < R, < 
y + dy}. 
(b) Show that the definition is natural in the sense that the appropriate version 
of the theorem of total probability is satisfied: 


PERLE, Rye B} = | fWMPERLE A| Re =v} dy 
B 
where 
P{R,€A| R, = y} =e =2x|R,=y} 
fly) = > PLR, = “hy | 2) 
[see (4.4.3)]. : 


13. If R, is absolutely continuous and Rp discrete, and p(y |) = P{R, =y| Ri = 
x} is specified, show that there is a conditional density of R, given R,, namely, 


_ fipy |#) 
*e |) =F) 


where 


Ply) = P{R, =ys = | fi@ply|*2) dx 
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14. Let R be uniformly distributed between 0 and 1. If R = 4, acoin with probability 


15. 


of heads / is tossed independently n times. If R,,..., R, are the results of the 
tosses (R; = 1 for a head, R; = 0 for a tail), find the conditional density of R 
given (R,,..., R,), and the conditional expectation of R given (R,,..., R,). 


(Hypothesis testing) Consider the following experiment. Throw a coin with 
probability p of heads. If the coin comes up heads, observe a random variable 
R with density f)(); if the coin comes up tails, let R have density /,(x). Suppose 
that we are not told the result of the coin toss, but only the value of R, and 
we have to guess whether or not the coin came up heads. We do this by means 
of a decision scheme, which is simply a Borel set S of real numbers with the 
interpretation that if R = x and xe S, we decide for tails, that is, f,, and if 
x & S we decide for heads, that is, fo. 
(a) Find the over-all probability of error in terms of p, fy, f;, and S. [There 
are two types of errors: if the actual density is fy and we decide for /, (type 
1 error), and if the actual density is f; and we decide for fy (type 2 error).] 
(b) For a given p, fo, f;, find the S that makes the over-all probability of error 
a minimum. Apply the results to the case in which f; is the normal density 
with mean m, and variance o”, i = 0, 1. 


REMARK. A physical model for part (b) is the following. The input R to a radar 


16. 


17. 


18. 


receiver is of the form #6 + N, where 6 (the signal) and N (the noise) are 
independent random variables, with P{@ = m} = p, P{@ =m} =1 —p, 
and N normally distributed with mean 0 and variance o”. If 0 = m; = 0 
corresponds to a head in the above discussion, and 7 = 1 to a tail), then R 
is normal with mean m, and variance o*; thus f; is the conditional density 
of R given 6 = m;. We are trying to determine the actual value of the signal 
with as low a probability of error as possible. 


Let R be the number of successes in ” Bernoulli trials, with probability p of 
success on a given trial. Find the conditional expectation of R, given that 
R > 2. 
Let R, be uniformly distributed between 0 and 10, and define Ry by 
R,=R? ifO0<R, <6 
=3 if6<R, < 10 
Find the conditional expectation of R, given that 2 < R, < 4. 


Consider the following two-stage random experiment. 
(i) A circle of radius R and center at (0, 0) is selected, where R has density 

fr@ =e 7%,z>0; fp@ =0,2z <0. 

Gi) A point (R,, Re) is chosen, where (R,, R) is uniformly distributed inside 
the circle selected in step (i). 

(a) If D = (R,? + R,”)" is the distance of the resulting point from the origin, 
find E(D). 

(b) Find the conditional density of R given R, = 7, R, = y. (Leave the answer 
in the form of an integral.) 
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19. (An estimation problem) The input R to a radar receiver is of the form 0 + N, 
where @ (the signal) and N (the noise) are independent random variables with 
finite mean and variance. The value of R is observed, and then an estimate of 
6 is made, say, 6* = d(R), where d is a function from the reals to the reals. 
We wish to choose the estimate so that E[(6* — 6)?] is as small as possible. 
(a) Show that d(x) is the conditional expectation E(@ | R =<). (Assume that 

R is either absolutely continuous or discrete.) 
(b) Let 6 = +1 with equal probability, and let N be uniformly distributed 
between —2 and +2. Find d(x) and the minimum value of E[(@* — 9)?]. 

20. A number @ is chosen at random with density f,(~) = e-*, x > 0; f,(x) = 0, 
x <0. If 6 takes the value 4, a random variable R is observed, where R has 
the Poisson distribution with parameter 4. For example, R might be the number 
of radioactive particles (or particles with some other distinguishing character- 
istic) passing through a counting device in a given time interval, where the 
average number of such particles is selected randomly. The value of R is 
observed and an estimate of 6 is made, say 0* = d(R). The argument of 
Problem 19, which applies in any situation when one makes an estimate 
6* = d(R) of a parameter 0, and when the distribution function of R depends 
on 0, shows that the estimate that minimizes E[(6* — 0)?]is d(x) = E(@ | R= 
x). Find d(x) in this case. 


REMARK. Problems 15, 19, and 20 illustrate some techniques of statistics. This 
. Subject will be taken up systematically in Chapter 8. 


4.5 APPENDIX: THE GENERAL CONCEPT 
OF CONDITIONAL EXPECTATION 


By shifting our viewpoint slightly, we may regard a conditional expectation 
as a random variable defined on the given probability space. For example, 
suppose that E(R,| R, = x) = z*. We may then say that, having observed 
R,, the average value of R, is R,?. We adopt the notation E(R, | Ry) = R;?. 
In general, if E(R,| Ri = x) = g(x), we define E(R,| Ri) = g(R,). Then 
E(R, | R,) is a function defined on Q; its value at the point w is g(R,(@)). 

Let us see what happens to the theorem of total expectation in this notation. 
If, for example, 


B(R,) = |" fi@ERe| Ri = 2) de =|" fla)el@) ae 
then E(R.) = E[g(R,)]; in other words, 
E(R,) = ElE(Rs | R,)] (4.5.1) 


The expectation of the conditional expectation of R, given R, is the ex- 
pectation of Re. 
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Let us develop this a bit further. Let A be a Borel subset of £1. Then, 
assuming that (4.5.1) holds for the random variable RoI;p, ¢ 4;, we have 


E(Rel;p,€43) = E[E(Rel pp cay | R,)] 


But having observed Ry, Rolin, ¢ 4, Will be R, if R, ¢ A, and 0 otherwise; 
thus we expect intuitively that 


E(RI {R,€4} | R;,) eo Tried E(Re | R;) 
It appears reasonable to expect, then, that 
E(Rol pn cay) = EllryeayE(Re | R,)] for all Borel subsets A of E1 (4.5.2) 


It turns out that if R, is an arbitrary random variable and R, a random 
variable whose expectation exists, there is a random variable R, of the form 
2(R,) for some Borel measurable function g, such that 


E(Relip, cas) = Elric R) for all Borel subsets A of E+ 


We set R = E(R,| R,). Furthermore, R is essentially unique: if R’ = g’(R,) 
for some Borel measurable function g’, and R’ also satisfies (4.5.2), then 
R = R’ except perhaps on a set of probability 0. 

In the cases considered in this chapter, the conditional expectations all 
satisfy (4.5.2) (which is just a restatement of the theorem of total expecta- 
tion), and thus the examples of the chapter are consistent with the general 
notion of conditional expectation. 


Characteristic Functions 


5.1 INTRODUCTION 


In Chapter 2 we examined the problem of finding probabilities of the form 
P{(R,, ..., Ry) € B}, where R,,..., R, were random variables on a given 
probability space. If (R,,..., R,) has density f, then 


P{(R,,..-,R,) €B} = [oo [Fey 5 9) days + dey 
B 


In general, the evaluation of integrals of this type is quite difficult, if it is 
possible at all. In this chapter we describe an approach to a particular class 
of problems, those involving sums of independent random variables, which 
avoids integration in n dimensions. The approach is similar in spirit to the 
application of Fourier or Laplace transforms to a differential equation. 

Let R be a random variable on a given probability space. We introduce the 
characteristic function of R, defined by 


M,(u) = E(e™*), ~—u real (5.1.1) 


Here we meet complex-valued random variables for the first time. A 
complex-valued random variable on (Q, F, P) is a function T from Q to the 
complex numbers C, such that the real part T, and the imaginary part 7, 
of T are (real-valued) random variables. Thus 7T(w) = T,(w) + iT,(), 
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weEQ. We define the expectation of T as the complex number E(T) = 
E(T,) + iE(1,); ECT) is defined only if E(T,) and E(T,) are both finite. In 
the present case we have Mp(u) = E (cos uR) — iE (sin uR); since the cosine 
and the sine are <1 in absolute value, all expectations are finite. Thus Mp 
is a function from the reals to the complex numbers. If R has density fp we 
obtain 


M,(u) = { e M@F (x) dz (5.1.2) 


which is the Fourier transform of fp. 
It will be convenient in many computations to use a Laplace rather than a 
Fourier transform. The generalized characteristic function of R is defined by 


N,(s) = E(e*®) _—__s complex} (5.1.3) 


N,(s) is defined only for those s such that E(e~**) is finite. If s is imaginary, 
that is, if s = iu, u real, then Np(s) = Mp(u), so that Np(s) is defined at 
least for s on the imaginary axis. There will be situations in which Np(s) 
is not defined for any s off the imaginary axis, and other situations in which 
N7(s) is defined for all s. 

If R has density fp, we obtain 


N (s) = | ° e f(x) dx (5.1.4) 


This is the (two-sided) Laplace transform of fp. 
The basic fact about characteristic functions is the following. 


Theorem 1. Let R,,...,R, be independent random variables on a given 
probability space, and let Ry = Ry ++°: + Ry. If Nes) is finite for all 
i= 1,2,...,n, then Np (s) is finite, and 


Np, (s) mg Nr, (s)Nr,(s) mek Nr, (s) 
In particular, if we set s = iu, we obtain 
Mp, (4) = Mp (u)Mp,(u)--: Mp, (u) 


Thus the characteristic function of a sum of independent random variables 
is the product of the characteristic functions. 


{ In doing most of the examples in this chapter, the student will not come to grief if he 
regards s as a real variable and replaces statements such as “‘a < Res < b’ by“a<s <b.” 
Also, a comment about notation. We have taken E(e~‘v®) as the definition of the charac- 
teristic function rather than the more usual E(e*“) in order to preserve a notational 
symmetry between Fourier and Laplace transforms ( f e *f(x) dx, not f e®*f(x) dx, is the 
standard notation for Laplace transform). Since u ranges over all real numbers, this change 
is of no essential significance. 
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PROOF. 


E(e 8) = E(e 8(P1t +--+ Rnd) — | TT om] get IT E( eSRx) 
k=1 k=1 


by independence. 


We have glossed over one point in this argument. If we take n = 2 for 
simplicity, we have complex-valued random variables V = V, + iV, and 
W=W,+ iW, (V =e, W = e**2), where, by Theorem 2 of Section 
2.7, V; and W,, are independent (j, k = 1, 2), and all expectations are finite. 
We must show that E(VW) = E(V)E(W), which we have proved only in the 
case when V and W are real-valued and independent. However, there is no 
difficulty. . 


E(VW) = E[V,W, — V2W, + iViW, + V2W,)] 
= E(V,)E(W,) — E(V2)E(W2) + (EVE) + EV)EW,)) 
= [E(V1) + iE(Wi)\[E(V2) + iE(W2)| = EV)E(W) 
The proof for arbitrary n is more cumbersome, but the idea is exactly the 
same. 

Thus we may find the characteristic function of a sum of independent 
random variables without any n-dimensional integration. However, this 
technique will not be of value unless it is possible to recover the distribution 
function from the characteristic function. In fact we have the following 
result, which we shall not prove. 


Theorem 2 (Correspondence Theorem). If Mp (u) = Mp,(u) for all u, then 
Fr (®) = Fr) for all x 
For computational purposes we need some facts about the Laplace trans- 


form. Let f be a piecewise continuous function from E* to E1 (not necessarily 
a density) and L, its Laplace transform: 


LAs) = [" f(xje~* dx 


Laplace Transform Properties 

1. If there are real numbers K, and K, and nonnegative real numbers A, 
and A, such that | f(«)| < A,e* for x > 0, and |f(«)| < A,e*” for x < 0, 
then L,(s) is finite for K, < Re s < Ky. This follows, since 


{ | f(xje*"| dx <| A,e*1- daz 
0 0 


and 
0 


0 
{ | f(x)e*?| dx <| Agere dx 
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where a = Res. The integrals are finite if K, < a < K,. Thus the class of 
functions whose Laplace transform can be taken is quite large. 

2. If g(x) = f(x — a) and L, is finite at s, then L, is also finite at s and 
L,(s) = e~* L,(s). This follows, since 


| : f(a — aye d(x — a) 


| f(a@— aye" dt =e” 
3. Ifh(~) = f(—x) and L, is finite at s, then L, is finite at —s and L,(—s) = 


L,(s) [or L,(s) = L,(—s) if L, is finite at —s]. To verify this, write 


| ° h(x)e** dx = (with y = -9|" f(yje-” dy 


4. If g(x) = e**f (x) and L, is finite at s, then L, is finite at s — a and 
L,(s — a) = L,(s) [or L,(s) = L,(s + a) if L, is finite at s + a]. For 


; awe (x) dx =|" e *f (x) dx 


co 


We now construct a very brief table of Laplace transforms for use in the 
examples. In Table 5.1.1, u(x) is the unit step function, defined by u(x) = 1, 


Table 5.1.1. Laplace Transforms 


f@) Ls) Region of Convergence 
u(x) 1/s Res >0 
e %y (2) 1/(s + a) Res > —a 
ere My (x), n=0,1,... ni/(s + a)” Res > —a 
x%e— yj (x), a>—1 T(a + 1)/(s + a)**1 Res > —a 


x > 0; u(x) = 0, x < 0. If we verify the last entry in the table the others will 
follow. Now 


ero) hate ere) y*e ¥ 
(04 aL ae = with = | eet i eee d 
[eereten twit y+ ona” LE 
_T@+) 
(s + a)*** 


{ Strictly speaking, these manipulations are only valid for s real and > —a. However, one 
can show that under the hypothesis of property 1 L,is analytic for Ky < Res < Kg. In the 
present case K, = —a and K, can be taken arbitrarily large, so that L, is analytic for 
Res > —a. Now L,(s) = I'(a + 1)/(s + a)*t1 for s real and > —a, and therefore, by the 
identity theorem for analytic functions, the formula holds for all s with Re s > —a. This 
technique, which allows one to treat certain complex integrals as if the integrands were 
real-valued, will be used several times without further comment. 
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REMARK. u(x) and —u(—2) have the same Laplace transform 1/s, but the 
regions of convergence are disjoint: 


| uae de = | Oia” Res >0 
0 


S 


foe) 0 
[ —u(—2)e"" dz = —e *dx= 7 Res <0 
cage ae s 


This indicates that any statement about Laplace transforms should be 
accompanied by some information about the region of convergence. 


We need the following result in doing examples; the proof is measure- 
theoretic and will be omitted. 

5. Let R be an absolutely continuous random variable. If is a nonnega- 
tive (piecewise continuous) function and L,(s) is finite and coincides with the 
generalized characteristic function Np(s) for all s on the line Re s = a, then 
h is the density of R. 


5.2 EXAMPLES 


We are going to examine some typical problems involving sums of independ- 
ent random variables. We shall use the result, to be justified in Example 6, 
that if R,, Ro,...,R, are independent, each absolutely continuous, then 
R, +++: +R, is also absolutely continuous. 

In all examples N,(s) will denote the generalized characteristic function of 
the random variable R,. 


> Example 1. Let R, and R, be independent random variables, with R, 
uniformly distributed between —1 and +1, and R, having the exponential 
density e~’u(y). Find the density of Ryo = R, + Rg. 

We have 


1 
N,(s) = | ie dix = - (e* —e*), all s 


N.(s) =| ee %dy = Res > —1 
als) { . s+1 eo 
Thus, by Theorem 1 of Section 5.1, 
1 = 
N,(s) = N,(s)N2(s) = ——~ (e* — e&*) 


2s(s + 1) 
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fC) 


fit< e7 (241) 3 fe (=~ 1) ae G&L) 


-1 0 1 


FiGureE 5.2.1 


at least for Re s > —1. To find a function with this Laplace transform, we 
use partial fraction expansion of the rational function part of No(s): 


1 1 1 


2s(s 4- 1) Os 2(s + 1) 


Now from Table 5.1.1, u(x) has transform 1/s (Res > 0) and e-“u(x) 
has transform 1/(s+1) (Res >-—1). Thus (1/2)(1 — e~*) u(x) has 
transform 1/2s(s + 1) (Res > 0). By property 2 of Laplace transforms 
(Section 5.1), (1/2)(1 — e~”)u(z% + 1) has transform e*/2s(s + 1) and 
(1/2)(1 — e—*)u(x — 1) has transform e~'/2s(s + 1) (Re s > 0). Thus a 
function 4 whose transform is No(s) for Re s > 0 is 


h(x) = 4(1 — e—@ )u(w + :1) — £0 — ec )u(e — 1) 


By property 5 of Laplace transforms, / is the density of Ry; for a sketch, see 
Figure 5.2.1. < 


p> Example 2. Let Ry = R, + R, + Rs, where R,, R,, and R, are independ- 
ent with densities f,(x) = fo(z) = e*u(—z), f(x) = e~@ Yu(x — 1). Find 
the density of Ro. 

We have 


0 
N,(s) = N,(s) =| e*e * dx = ; : : Res<l 
—00 —. 5. 
and 
N.(s) = * gle e-se dx = a : Res>—1 
i) =} oa 


Thus 
No(s) = Ny(s)N2(s)N,(s) = 


—Ss 


e 
SS —1< Re 1 
(—1%+)’ - 
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We expand the rational function in partial fractions. 
1 A B C 
Se 5 + —— 
(s—1Gw+1) (s—tD s—1 s+1 
The coefficients may be found as follows. 


A = [(s — 1)°G(5)],1 = 4 


B= |i (s—1'G()|_ = 2 


s=1 


C = [(s + )G®,-1 =} 
From Table 5.1.1, the transform of xe~“u(x) is 1/(s + 1)?, Res > —1. 
By Laplace transform property 3, the transform of —xe*u(—z) is 1/(1 — s)?, 
Re s < 1. The transform of e~*u(z) is 1/(s + 1), Res > —1, so that, again 
by property 3, the transform of e*u(—z) is 1/(1 — s), Res < 1. 
Thus the transform of 
—4ue"u(—2) + feu(—2) + Je*u(e) 
iS 
12 4 aa 
(s—1)? s—1 s+1- 
By property 2, the transform of 


G(s) = —1<Res<1 


h(x) = [4 — 3(% — DJe**u(—(@ — 1) + de“ Pu — 1) 


e“G(s) = No(s), —1< Res < 1 
By property 5, / is the density of Ry (see Figure 5.2.2). < 


A(x) 
x 
0 1 
FiGurE 5.2.2 


A(x)=(4@+e0—2z)je"", «<l 
—_ Le—(e—-1) | 2 > | 


5.2 EXAMPLES 6! 
p> Example 3. Let R have the Cauchy density; that is, 
a 
mi + a”) ° 


The characteristic function of R is 


fR(®) = —-O<X<K 


Malu) = | e™*f(x) de 
[In this case Np(s) is finite only for s on the imaginary axis.] Mp(u) turns 
out to be e!«l, This may be verified by complex variable methods (see 
Problem 9), but instead we give a rough sketch of another attack. If the 
characteristic function of a random variable R is integrable, that is, 


[ |M,(u)| du < 


it turns out that R has a density and in fact fp is given by the inverse Fourier 
transform. 


fre) = “s [ “Mle du (5.2.1) 


In the present case 


00 0 oe) 
[fem du = { et du + { e“du=2< @ 
—00 0 


—0 


and thus the density corresponding to e—!! is 


1 7 —lul tux 1 ° u(1+iax) ey . —u(1—iz) 
— e Mietue dy = — e du + e du 
2a J—x 27 Jo 


27 — 00 


= 1 1 | 1 
2rm_1+ix 1—iz a1 + x*) 


Thus the Cauchy density in fact corresponds to the characteristic function 
en lull, 

This argument has a serious gap. We started with the assumption that 
e—|«l was the characteristic function of some random variable, and deduced 
from this that the random variable must have density 1/m(1 + x”). We must 
establish that e!“! is in fact a characteristic function (see Problem 8). 

Now let Ro = R, +::: + R,, where the R; are independent, each with 
the Cauchy density. Let us find the density of Ry. We have 


M,(u) = M,(u)M,(u)--: M,(u) = (ely = gon 
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If instead we consider Ry/n, we obtain 
Mpyjn(u) = Efe *""] = My @ = elu 
n 


Thus R,/n has the Cauchy density. Now if R,=nR,, then f(y) = 
(1/n)f,(y/n) (see Section 2.4), and so the density of Ry is 


f(y) = se ee ne ee 
me al + Pn) ay? + 0°) 
REMARKS. 

1. The arithmetic average R,/n of a sequence of independent Cauchy 
distributed random variables has the same density as each of the 
components. There is no convergence of the arithmetic average to 
a constant, as we might expect physically. The trouble is that E(R) 
does not exist. 


2. If R has the Cauchy density and R,; = c,R, Rp=c2R, C4, Ce 
constant and > 0, then 


M,\u) = E(e#B3) = E(e FR) — M,(cyu) = el"! 
and similarly 
M,(u) = ell 
Thus, if Rp = R, + R, = (c, + c2)R, 
M,(u) = e—(erte2)| ul 


which happens to be M,(u)M,(u). This shows that if the char- 
acteristic function of the sum of two random variables is the prod- 
uct of the characteristic functions, the random variables need not 
be independent. 


3. If R has the Cauchy density and R, = OR, 0 > 0, then by the 
calculation performed before Remark 1, R, has density f,(y) = 
O/ar(y + 67) and (as in Remark 2) characteristic function M,(u) = 
elul| A random variable with this density is said to be of the 
Cauchy type with parameter 0 or to have the Cauchy density with 
parameter 0. The formula for M,(u) shows immediately that if 
R,,...,R, are independent and R, is of the Cauchy type with 
parameter 0,,i=1,...,n, then R, + +--+ R, is of the Cauchy 
type with parameter 0, +--:+0,. < 


> Example 4. If R,, R,,..., R, are independent and normally distributed, 
then Ry = R, +--+ + R, is also normally distributed. 
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We first show that if R is normally distributed with mean m and variance 
o*, then 


N,(s) =e *™e*? 2 ~— (all s) (5.2.2) 
Now 
N;(s) = [ ef p(x) dz = | est ge (a—m)"/20" oy 
e! —00 — 00 J2a O 


Let y = (x — m)|J/ 2o and complete the square to obtain 


N (s) = ue —sm i. 2 /2 6 a so” es F/2 
R = Lae —{Y s y 5 y 


_ 2 gnsm stot) { oF dt = eme*2 by (2.8.2) 
7 —o ° 
(See the footnote on page 157.) Now if E(R,) = m,, Var R; = o;?, then 
No(s) = Ny(s)No(s) +++ Na (s) == e78mrbestme) gs! oriteetan)/2 


But this is the characteristic function of a normally distributed random 
variable, and the result follows. Note that m) =m, +:::+m,, 06)? = 
0,7 +--+ + 6,2, as we should expect from the results of Section 3.3. < 


> Example 5. Let R have the Poisson distribution. 


e* Wh 

PR(k) = re k=0,1,... 

We first show that the generalized characteristic function of R is 
N(s) = exp [A(e* — 1)] (all s) (5.2.3) 
We have 
—sR ~ e sk < eW*A" —sk 
N;(s) = Ee) = Prk) => . 
k=0 x=0 k! 
—s\k 
eo" 5 a = e“ exp (Ae *) 
k=0 


as asserted. 

We now show that if R,,...,R, are independent random variables, 
each with the Poisson distribution, then Rp = R, +-°:: + R, also has the 
poisson distribution. 

If R,; has the Poisson distribution with parameter /,, then 


No(s) = Ny(s)No(s) ° ++ N,(s) = exp [(A, + +++ + 4,)(e* — D] 


This is the characteristic function of a Poisson random variable, and the 
result follows. Note that if R has the Poisson distribution with parameter 
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A, then E(R) = Var R = A (see Problem 8, Section 3.2). Thus the result that 
the parameter of Ry is A, +--+ + A, is consistent with the fact that E(Ry) = 
E(R,) +-°+:-+ E(R,) and Var Ryo = Var R, +-::-+ Var R,. < 


> Example 6. In certain situations (especially when the Laplace transforms 
cannot be expressed in closed form) it may be convenient to use a convolu- 
tion procedure rather than the transform technique to find the density of a 
sum of independent random variables. The method is based on the following 
result. 


Convolution Theorem. Let R, and R, be independent random variables, 
having densities f, and f,, respectively. Let Ry = R, + Ry. Then Ry has a 
density given by 


fd = [ ” fle ~ OF (@) de = [ “hE Dhy)dy (5.24) 


(Intuitively, the probability that R, lies in (w, x + dz] is f(x) dx; given that 
R, = «x, the probability that Ry lies in (z, z + dz] is the probability that R, 
lies in (2 — «,z — x + dz], namely, f,(z — x) dz. Integrating with respect to 
x, we obtain the result that the probability that Ry lies in (z, 2 + dz] is 


de[" fe — f(a) de 
Since the probability is f(z) dz, (5.2.4) follows.) 


Proor. To prove the convolution theorem, observe that 


Fo) = PER, + Re <2} = [|] fle) fly) de dy 


aty Sz 


=| | fal) ay | On 


Let y = u — « to obtain 


iz nec — 2) au | A) dx =| | [pene — 2) dz | a 


This proves the first relation of (5.2.4); the other follows by a symmetrical 
argument. 

We consider a numerical example. Let fi(v) = 1/27, x > 1; fi(x) = 0, 
x<l. Let f(y) =1,0< y < 1; f(y) = 0 elsewhere. If z < 1, fo(z) = 0; 
ib < 2 <2. 

KO =| hehe -2)de =|. de=1-- 


ze 


5.2 EXAMPLES 165 


f,(—-*) fy —(x—2))=h(2-x), AR(2-*), 272 
1 1 lsz=2 1 
=f 0 7 a a rae ae aaa 
f(x) f(x) 
to 1 7 


If z > 2, 


(see Figure 5.2.3). 


REMARK. The successive application of the convolution theorem shows that 
if R,,...,R, are independent, each absolutely continuous, then 
R, + +--+ R, is absolutely continuous. < 


PROBLEMS 


1. Let R,, R,, and R; be independent random variables, each uniformly distrib- 
uted between —1 and +1. Find and sketch the density function of the random 
variable Ry = R, + Ry + Rg. 

2. Two independent random variables R, and R, each have the density function 
f@) =1/3, -1 <2 <0; f@ =2/3, 0 <x <1; f(%) =0 elsewhere. Find 
and sketch the density function of R, + Rp. 

3. Let R= R,? +---+R,?, where R,,...,R, are independent, and each R; 
is normal with mean 0 and variance 1. Show that the density of R is 


f@) = 


1 
og (11/2) —1 — 0/2 
PATE”  * OO 


(R is said to have the “chi-square” distribution with n “‘degrees of freedom.’’) 
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4. A random variable R is said to have the ‘‘gamma distribution’’ if its density is, 
for some «, 8 > 0, 
x-1e—a/B 


LO Bayar 8 Le SO 


Show that if R, and R, are independent random variables, each having the 
gamma distribution with the same 8, then R, + R, also has the gamma 
distribution. 
5. If Ri, ..., R, are independent nonnegative random variables, each with density 
de~**y (x), find the density of Ry = Ry +--+ + Ry. 
6. Let 4 be uniformly distributed between — 7/2 and 7/2. Show that tan 6 has the 
Cauchy density. 
7. Let R have density f(x) = 1 — |2|,|z| <1; f@) =0, |z| > 1. Show that 
Mp(u) = 2(1 — cos u)/u*. 
*8. (a) Suppose that f is the density of a random variable and the associated 
characteristic function M is real-valued, nonnegative, and integrable. 
Show that kf(u), — 0 <u < oo, is the characteristic function of a random 
variable with density kKM(«)/27, where k is chosen so that kf(0) = 1, 
that is, 


{ : [kKM(x)/27] dx = 1 


—0 


(b) Use part (a) to show that the following are characteristic functions of 
random variables: (i) e!#l, Gi) M(u) =1 —|ul, |u| <1; Mu) =0, 
lu] > 1. 
*9. Use the calculus of residues to evaluate the characteristic function of the Cauchy 
density. 
10. Calculate the characteristic function of the normal (0,1) random variable as 
follows. Differentiate 


foe) ea x?/2 
M(u) = { (cos ux) Jan dx 
—0O T 


under the integral sign; then integrate by parts to obtain M’(u) = —uM(u). 
Solve the resulting differential equation to obtain M(u) = e—u*/2, Erom this, 
find the characteristic function of a random variable that is normal with mean 
m and variance o”. 


5.3 PROPERTIES OF CHARACTERISTIC FUNCTIONS 


Let R be a random variable with characteristic function M and generalized 
characteristic function N. We shall establish several properties of M and N. 
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1. M(O) = N(O) = 1. 

This follows, since M(0) = N(O) = E(e’). 
2. |M(u)| < 1 for all u. 

If R has a density f, we have 


IM] = | [ve f(a) dz 


<[le“s (x)| dz = iz (x) dx = 1 


The general case can be handled by replacing f(x) dx by dF(x), where F 
is the distribution function of R. This involves Riemann-Stieltjes integration, 
which we shall not enter into here. 

3. If R has a density f, and f is even, that is, f(—x) = f(x) for all 2, 
then M(u) is real-valued for all u. For 


M(u) =| "s@ cos ux dx — i|"F@ sin ux dx 


Since f(z) is an even function of x and sin ux is an odd function of 2, 
f(«) sin ux is odd; hence the second integral is 0. 

It turns out that the assertion that M(u) is real for all wis equivalent to the 
statement that R has a symmetric distribution, that is, P{R €¢ B} = P{Re —B} 
for every Borel set B. (—B = {—a: x € B}.) 

4. If Ris a discrete random variable taking on only integer values, then 
M(u + 27) = M(u) for all w. 

To see this, write 


M(u) = E(e-™®) = > p,ent (5.3.1) 


where p, = P{R = n}. Since e~*@” = e~t(u+27)n, the result follows. 
Note that the p, are the coefficients of the Fourier series of M on the 
interval [0, 27]. If we multiply (5.3.1) by e*“* and integrate, we obtain 


27 
Pr = ae | M(uje™* du (5.3.2) 


We come now to the important moment-generating property. Suppose 
that N(s) can be expanded in a power series about s = 0. 


k=0 


where the series converges in some neighborhood of the origin. This is just 
the Taylor expansion of N; hence the coefficients must be given by 


oes 1 od 
© kl ds® |s-o 
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But if R has density f and we can differentiate N(s) = J°., e~** f(x) dx under 


the integral sign, we obtain N’(s) = f®, — we-**f(x) dx; if we can 
differentiate k times, we find that 


N™(s) = | * (—LDkate-*f (x) da 
Thus ~“ 
N®(0) = (—1)*E(R*) (5.3.3) 
and hence 
a, = c* E(R*) 


The precise statement is as follows. 
5. If Np(s) is analytic at s = 0 (i.e., expandable in a power series in a 
neighborhood of s = 0), then all moments of R are finite, and 


N,(s) => — E(R*)s* (5.3.4) 


within the radius of convergence of the series. In particular, (5.3.3) holds 
for all k. 

We shall not give a proof of (5.3.4). The above remarks make it at least 
plausible; further evidence is presented by the following argument. If R 
has density f, then 


N(s) -| e "Ff (x) dx 
fe sta? 3g (—1)¥stax 
=[L(1- s+ 7 3° oe )4@ de 


If we are allowed to integrate term by term, we obtain (5.3.4). 

Let us verify (5.3.4) for a numerical example. Let f(x) = e~*u(x), so that 
N(s) = 1/(s + 1), Re s > —1. We have a power series expansion for N(s) 
about s = 0. 


os 
1+s 


Equation 5.3.4 indicates that we should have (—1)*E(R*)/k! = (—1)*, or 
E(R*) = k! To check this, notice that 


=1—s+s?¥—s?+---4+(—1's* +--+ Is] <1 


E(R*) = {  gte-® dx = T(k +1) = k! 
0 
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REMARK. Let R be a discrete random variable taking on only nonnegative 


integer values. In the generalized characteristic function 
N(s) = > pre Py = P{R = k} 
k=0 
make the substitution z = e~*. We obtain 
A() = N(S)ene* = E@") = 2 pt" 
k= 


A is called the generating function of R; it is finite at least for |z| < 1, 
since >” , p, = 1. 


We consider generating functions in detail in connection with the random 


walk problem in Chapter 6. 


PROBLEMS 


1. 


Could [2/(s + 1)] — (1/s)(1 — e~*)(Re s > 0) be the generalized characteristic 
function of an (absolutely continuous) random variable? Explain. 


. If the density of a random variable R is zero for x ¢ the finite interval [a, 5]; 


show that N,,(s) is finite for all s. 


. We have stated that if Mp(u) is integrable, R has a density [see (5.2.1)]. Is the 


converse true? 


. Let R have a lattice distribution; that is, R is discrete and takes on the values 


a + nd, where a and d are fixed real numbers and 7 ranges over the integers. 
What can be said about the characteristic function of R? 


. If R has the Poisson distribution with parameter 4, calculate the mean and 


variance of R by differentiating N p(s). 


3.4 THE CENTRAL LIMIT THEOREM 


The weak law of large numbers states that if, for each n, Ry, Ro,..., R, 
are independent random variables with finite expectations and uniformly 
bounded variances, then, for every « > 0, 


Pf. S (& — ERo| > e| 0 as Nn —> 00 
nNi=1 
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In particular, if the R,; are independent observations of a random variable R 
(with finite mean m and finite variance o”), then 


"| 


The central limit theorem gives further information; it says roughly that 
for large n, the sum R, +--+ + R, of n independent random variables is 
approximately normally distributed, under wide conditions on the individual 
R;. 

To make the idea of “approximately normal’’ more precise, we need the 
notion of convergence in distribution. Let R,, Re, ... be random variables 
with distribution functions F,, F,,..., and let R be a random variable with 
distribution function F. We say that the sequence R,, Ro, ... converges in 
distribution to R (notation: R,, onl R) iff F,(2) > F(x) at all points 2 at 
which Fis continuous. 

To see the reason for the restriction to continuity points of F, consider 
the following example. 


Ni=1 


> | 0 as nN —> 00 


p> Example 1. Let R, be uniformly distributed between 0 and 1/n (see 
Figure 5.4.1). Intuitively, as n—> 00, R, approximates more and more 
closely a random variable R that is identically 0. But F,,(v) — F(x) when 
x0, but not at x = 0, since F,(0) = O for all n, and F(O) = 1. Since 


; sts ; d 
x =Ois nota continuity point of F, we have R, —>R. < 


Fy, (x) F(x) 
1 1 
x x 
0 2 0 
f(x) 
n 
0 1 7 


FiGure 5.4.1 Convergence in Distribution. 
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REMARK. The type of convergence involved in the weak law of large 
numbers is called convergence in probability. The sequence R,, 


Rz, ... 1s said to converge in probability to R (notation: R, aan R) 
iff for every e > 0, P{|R, — R| > e} ~0 as n— oo. Intuitively, for 
large n, R,, is very likely to be very close to R. Thus the weak law of 


large numbers states that (1/n) >”, (R; — E(R)) eae 0; in the case 
in which E(R,) = m for all i, we have 


1x P 
ni=1 
The relation between convergence in probability and convergence in 
distribution is outlined in Problem 1. 


The basic result about convergence in distribution is the following. 


Theorem 1. The sequence R,, R2, ... converges in distribution to R if and 
only if M,(u) — M(u) for all u, where M,, is the characteristic function of 
R,,, and M is the characteristic function of R. 


The proof is measure-theoretic, and will be omitted. 

Thus, in order to show that a sequence converges in distribution to a 
normal random variable, it suffices to show that the corresponding sequence 
of characteristic functions converges to a normal characteristic function. 
This is the technique that will be used to prove the main theorem, which we 
now state. 


Theorem 2, (Central Limit Theorem). For each n, let Ri, Ro,...,R, be 
independent random variables on a given probability space. Assume that the 
R;, all have the same density function f (and characteristic function M) with 
finite mean m and finite variance o? > 0, and finite third moment as well. Let 


(pe no 
Jno 


(= [S, — E(S,)|/o(S,), where S, = R,+:°::+R, and oa(S,) is the 
standard deviation of S,) so that T,, has mean 0 and variance 1. Then F,, 
T,, . . . converge in distribution to a random variable that is normal with mean 
0 and variance 1. 


*Before giving the proof, we need some preliminaries. 
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Theorem 3. Let f be a complex-valued function on E* with n continuous 
derivatives on the interval V = (—b, b). Then, on V, 


‘TS PO (1 — 1)" pm 
f(u a wf & =e oat) (tu) dt 


Thus, if | f'| < M on V, 


fw=> where |6| < M (6 depends on u) 


n—-1 ¢(k) k n 
Ou Iu 
t=0 @=6K ! 


Proof. Using integration by parts, we obtain 


(n) (u — t)”* (u = t)"- tity u ee (u _ \ 
I! Oo foam mw! |, + +f ra TT 


fe Ou" (ny 

(n — 1)! + | OGD ~» “ 
f° Ou _ f?*O)u"? 

(n — 1)! (n — 2)! 


(n—2) (u cas t)"* 
+{7 (go at 


+. f'(t) dt by iteration 


_ soe oe 


k=1 


FLO , (my eae 
fw = oO t Ga at 


Thus 


The change of variables t = ut’ in the above integral yields the desired ex- 
pression for f(u). 


Now if 
1 a n—1 
u"| (= (tu) dt 
0 (n—1)! 
then 
1 5 n—1 n 
[I] <M |u|” GQ =)" gy — Mie 
0 (n—1)! n! 


Let 6 = In!/|u|"; then |9| < M and the result follows. 
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Theorem 4. 


2 
@ ev=1t+iy+%, ost 


2 3 
va ttiy Fea ol <i 


where y is an arbitrary real number, 6, 6, depending on y. 


Proor. This is immediate from Theorem 3. 

(b) If z is a complex number and |z| < 1/2, then In (1 + 2) = 2+ 6 |2?, 
where |6| < 1, 6 depending on z. (Take In 1 = 0 to determine the branch of the 
logarithm.) 


PROOF. 
z7 2% 24 
In(ditz2=2z2—--4+--—--4: 
( ) 3 4 
1 z 2 
=z+27/(—- pres = $5. 
eure Gan. 
Now 


Img te — + 1 SE + dD + 1 +-° Sd =1 
k=1 
Since 22 = |z|%e#¢79'2”) and |etara'2)| = 1, the result follows. 


PROOF OF THEOREM 2. By Theorem 1, we must show that Mp (u) > ew 2, 
the characteristic function of a normal (0, 1) random variable. Now we 
may assume without loss of generality that m = 0. For if we have proved 
the theorem under this assumption, then write 


>(R; — m) 
ee 
0 faa 
The random variables R; — m have mean 0 and variance o”, and the result 
will follow. 
The characteristic function of }”_, R; is (M(u))”; hence the characteristic 
function of T,, is 


u 


Bete) = Elen (—1 = BR) =| M( EY] 
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Now 


u(= -) a Ls OTe 


id iux uae oO, |u|? |x? 
oe a al Daeg aaa 
Jno 2no 6n*o 


f(x) dx by Theorem 4a 
But 


[xe dx =m =0, [°xf00 es 


— —0oO 


M(E) |= (1- ant a) 


where c depends on u. Take logarithms to obtain, by Theorem 4b, 


2 ur 
)-§ 
2 


as n—> 0 


Thus 


u" C u C u" C 
nin (1-2 + 5) an(- 2 4 5 4 0|- Ea 


hac) eal 


which is the desired result. * 


Thus 


REMARK. Convergence in distribution of T, = [S, — E(S,)|/o(S,), S, = 
>2, R;, o(S,) = standard deviation of S,, can be established under 
conditions much more general than those given in Theorem 2. For 
example, the finiteness of the third moment is not necessary in 
Theorem 2; neither is the assumption that the R, have a density. 
We give two other sufficient conditions for normal convergence. 

1. The R; are uniformly bounded; that is, there is a constant k such 
that |R,| < k for all 7, and also Var (S,,) > oo. 
The requirement that Var S,,— oo is necessary, for otherwise 
we could take R, to have an arbitrary distribution function, and 
R, = Oforn > 2.-Then S, = R,, 


_S,— ES, Ry, —.ER, 
o(S;,) o(R;) 


Thus the functions Fy are the same for all n and hence in general 
cannot approach 


n 


F*(2) =| = et 2 ay 
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which is the normal distribution function with mean O and 
variance 1. 


2. (The Liapounov condition) 


SE |R, — ER,|**° 
pr=1 


aS —> 0 for some 6 > 0 
oO n 


REMARK. For each 7, let R,, Ry, ..., R, be independent random variables, 
where all R,; have the same distribution function, with finite mean m, 
finite variance o?, and also finite third moment. It can be shown that 
there is a positive number k such that |F7 (v) — F*(«)| < k/ Vn for 
all x and all n. It follows that, for large n, S,, is approximately normal 
with mean mm and variance no”, in the sense that for all x, 


Ppum 


Jn 
nm ~ %— = 
Jno ~ /no 
and this differs from F*((# — nm)/ Vn a) by <k/V n. But 
F* (7) = | "2 ey 


Jno —co /2a 


= {sett = wea ee [ oe (y-nm)*/2ne” dy 
Jno }./2amno 


Fg (2) -| ft ge (t-nm) 2/2n0" dt 
—00 J2an oO 
For 


Fg («) = P{S, < x} = se 


and the result follows. 

In particular, let R be the number of successes in n Bernoulli 
trials; then (Example 1, Section 3.5) R= R, +--+: + R,, where the 
R, are independent, and P{R,; = 1} = p, P{R, = 0} = 1 — p. Thus, 
for large n, R is approximately normal with mean np and variance 
np(1 — p), in the sense described above. 


PROBLEMS 


1. Show that R,, ae R implies R,, _*, R,as follows. Let F,, be the distribution 
function of R,,, and F the distribution function of R. 
(a) If « > 0, show that 


and 
P{R <x —« < P{|R, — R| > c} + P{R, < 2} 
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Conclude that 


F(e — ¢) — P{|R, — R| >} < Fy) < PUR, -RI>G + F@ +6) 


(b) If R, > R and F is continuous at x, show that F,(x) > F(e). 


2. Give an example of a sequence R,, Ro,... that converges in distribution to a 
random variable R, but does not converge in probability to R. 


3. If R, converges in distribution to a constant c, that is, lim,.,,, F,@) = 1 for 
x >c, and = 0 for x < c, show that Ris. C. 

4, Let R,, R., ... be random variables such that P{R, = e"} = 1/n, P{R, =0} = 
1 — 1/n. Show that R,, ae 0, but E(R,”) > © asn > o, for any fixed k > 0, 


5. Two candidates, A and B, are running for President and it is desired to predict 
the outcome of the election. Assume that n people are selected independently 
and at random and asked their preference. Suppose that the probability of 
selecting a voter who favors A in any particular observation is p. (p is fixed but 
unknown.) Let Q,, be the relative frequency of “A” voters in the sample; that is, 


number of ‘‘A”’ voters in sample 


aan size of sample 


(a) We wish to choose n large enough so that P{|Q, — p| < .001} > .99 for 
all possible values of p. In other words, we wish to predict A’s percentage 
of the vote to within .1%, with 99% confidence. Estimate the minimum 
value of n. 

(b) Estimate the minimum value of n if we wish to predict A’s percentage to 
within 1%, with 95% confidence. (Use the central limit theorem.) 

6. (a) Show that the normal density function (with mean 0, variance 1) satisfies 
the inequality 


co 1 1 
— gle dc —— e 07/2 x >Q 
| JV Vv 


x vs Zar x 


HINT: show that 


- 7 x?/2 = ai [" et*/2 ] + i at 


by differentiating both sides. 
(b) Show that 


a ae et7/2 dt wm = ea a7/2 


in the sense that the ratio of the two sides approaches 1 as x — oo, 


7. Consider a container holding n = 10° molecules. In the steady state it is reason- 
able that there be roughly as many molecules on the left side as on the right. 
Assume that the molecules are dropped independently and at random into the 
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container and that each molecule may fall with equal probability on the left 
or right side. 

If R is the number of molecules on the right side of the container, we may 
invoke the central limit theorem to justify the physical assumption that for the 
purpose of calculating P{a < R < b} we may regard R as normally distributed 
with mean np = n/2 and variance np(1 — p) = n/4. 

Use Problem 6 to bound P{|R — n/2| > .005n}, the probability of a fluctuation 
about the mean of more than +.5% of the total number of molecules. 


. Let R be the number of successes in 10,000 Bernoulli trials, with probability 
of success .8 on a given trial. Use the central limit theorem to estimate 
P{7940 < R < 8080}. 


Infinite Sequences of 
Random Variables 


6.1 INTRODUCTION 


We have not yet encountered any situation in which it is necessary to con- 
sider an infinite collection of random variables, all defined on the same prob- 
ability space. In the central limit theorem, for example, the basic underlying 
hypothesis is “‘For each n, let R,,..., R, be independent random variables.”’ 
As n changes, the underlying probability space may change, but this is of 
no consequence, since a convergence in distribution statement is a statement 
about convergence of a sequence of real-valued functions on El. If R,,...,R, 
are independent, with distribution functions F,,...,F,, and T, = 
(S,, — E(S,))/o(S,), S, = Ri +++: + R,, the distribution function of T,, 
is completely determined by the F,, and the validity of a statement about 
convergence in distribution of T,, is also determined by the F,, regardless of 
the construction of the underlying space. 

However, there are occasions when it is necessary to consider an infinite 
number of random variables defined on the same probability space. For 
example, consider the following random experiment. We start at the origin 
on the real line, and flip a coin independently over and over again. If the 
result of the first toss is heads, we take one step to the right (i.e., from x = 0 
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to = 1), and if the result is tails, we move one step to the left (to x = —1). 
We continue the process; if we are at x = k after v trials, then at trial n + 1 
we move tox = k + 1 if the (n + 1)th toss results in heads, or tox = k — 1 
if it results in tails. We ask, for example, for the probability of eventually 
returning to the origin. 

Now the position S,, after n steps is the sum R, +--+ + R, of n independ- 
ent random variables, where P{R, = 1} = p= probability of heads, 
P{R, = —1} = 1 — p. We are looking for P{S,, = 0 for some n > 0}. 

We must decide what probability space we are considering. If we are 
interested only in the first n trials, there is no problem. We simply have a 
sequence of n Bernoulli trials, and we have considered the assignment of 
probabilities in detail. However, the difficulty is that the event {S,, = 0 for 
some n > 0} involves infinitely many trials. We must take Q = E* = all 
infinite sequences (2%, X_,...) of real numbers. (In this case we may restrict 
the x; to be +1, but it is convenient to allow arbitrary x,; so that the discussion 
will apply to the general problem of assigning probabilities to events in- 
volving infinitely many random variables.) 

We have the problem of specifying the sigma field A and the probability 
measure P. The physical description of the problem has determined all 
probabilities involving finitely many R,; that is, we know P{(R,,..., R,) € B} 
for each positive integer n and n-dimensional Borel set B. What we would 
like to conclude is that a reasonable specification of probabilities involving 
finitely many R; determines the probability of events involving all the R;. 
For example, consider {all R; = 1}. This event may be expressed as 


R= 1.2.5.2, = 1} 
The sets {R; = 1,..., R, = 1} form a contracting sequence; hence 
Piall R; = 1} =lim P{R, = 1,...,R, =1}=limp"=0 ifp<1 
As another example, 
.R,, = 1 for infinitely many n} 
= {for every n, there exists k > n such that R, = 1} 


n=1k=n 
Thus 
P{R,, = 1 for infinitely many n} = lim P| U he = | 
nN oO k=n 


= lim lim P| U {R, = B| 


n>co m—>oo k=n 
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Thus again the probability is determined once we know the probabilities 
of all events involving finitely many R,;,. 

We now sketch the general situation. Let Q = E~. A set of the form 
{(#1, Ly,...): (%,...,%,) € B,}, where B, < E”, is called a cylinder with 
base B,,, a measurable cylinder if B, is a Borel subset of E”. 

Suppose that for each n we specify a probability measure P,, on the Borel 
subsets of E”; P,,(B,) is to be interpreted as P{(R,,...,R,) € B,}, where 
R@1, Lo, ..-) = X%. 

Suppose, for example, that we have specified P;. Then P,, k <5, is 
determined. In particular, in the discrete case we have 


P{(R,, Re, Rg) € Bs} 
score > P{R, = X, R, = Xo, R, = Xs, R, = X4, R; a a, 
(x1,%2,%3)EB3, 


—o<7< OO, 
—0 <25 <0 


and in the absolutely continuous case we have 


P{(Rz, Re, Rs) € Bs} = [. ; {fe Lo, Lz, U4, U3) dx, dx, dx, dx, dx, 


(v1,22,%3)€ Bs 
—0<2%5< 00 


In general, once P,, is given, P,, k <n, is determined. But we have specified 
P,, k <n, at the beginning; if our assignment of probabilities is to make 
sense, the original P, must agree with that derived from P,, n > k. 

If, for allmn = 1,2,...and all k <n, the probability measure P,, origi- 
nally specified agrees with that derived from P,, n > k, we say that the prob- 
ability measures are consistent. Under the consistency hypothesis, the 
Kolmogorov extension theorem states that there is a unique probability 
measure P on ¥ = the smallest sigma field of subsets of © containing the 
measurable cylinders, such that 


P (the measurable cylinder with base B,) = P,,(B,) 


for alln = 1,2,... and all Borel subsets B,, of E”. 

In other words, a consistent specification of finite dimensional probabilities 
determines the probabilities of events involving all the R;. 

We now consider the case in which the R, are discrete. Here we determine 
probabilities involving (R,,..., R,) by prescribing the joint probability 
function 

Pre.--n(%1, aes Tn) = P{R, = %,---,; R,, Ln} 
We may then derive the joint probability function of R,,..., R,: 


P{R, = %,...,R,=%,}$= > P{R,=%,...,R, =2,} (6.1.1) 


Le+1s eres Ly, 
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If this coincides with the given p4p..., (for all m and all k < n) we say that the 
system of joint probability functions is consistent. If we sum (6.1.1) over 
(v,,...,%,) € B,, we find that consistency of the joint probability functions 
is equivalent to consistency of the associated probability measures P,. 
Thus in the discrete case the essential point is the consistency of the joint 
probability functions. In particular, suppose that we require that for each n, 
R,,..., R, be independent, with R; having a specified probability function 
Pp; Then (6.1.1) becomes 


PIR, = %,..., Ry = %} = DD Pal@1)* ++ Pr(*n) = Prl%s) * +» Pele) 


cy st RO 
and thus the joint probability functions are consistent. The point we are 
making here is that there is a unique probability measure on ¥ such that the 
random variables R,, Re,... are independent, each with a specified prob- 
ability function. In other words, the statement “‘Let R,, R,, ... be independ- 
ent random variables, where R, is discrete and has probability function p,,”’ 
is unambiguous. 
In the absolutely continuous case, probabilities involving R,,..., R, 
are determined by the joint density function f,....,. The joint density of 
R,,..., R, is then given by 


2(%,...,%) = | < | Prgen(®1y «+ - 5 Un) AX 44 °°° de, (6.1.2) 


If this coincides with the given fj>..., (vn, k = 1,2,...,k <n), we say that 
the system of joint densities is consistent. By integrating (6.1.2) over Borel 
sets B, © E", we find that consistency of joint density functions is equivalent 
to consistency of the associated probability measures P,,. In particular, if we 
require that for each n, R,,...,R, be independent, with R,; having a 
specified density function /,, then (6.1.2) becomes 


att) = [oo [AG fled) dega de, 


ce 6) 


= fi(%1) ++ + fie) 
Therefore the joint density functions are consistent, and the statement 
“Let Ry, Ra... be independent random variables, where R, is absolutely 
continuous with density function f,,”’ is unambiguous. 


PROBLEMS 


1. By working directly with the probability measures P,,, give an argument shorter 
than the one above to show that the statement “Let R,, Ro, ... be independent 
random variables, where R; is absolutely continuous with density f;,” is 
unambiguous. 
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2. If R,, Ro, ... are independent, with P{R, = 1} = p, P{R; = —1} =1 —p, as 
at the beginning of the section, find P{R, = 1 for infinitely many 7}; also, find 
Pflimy_,, Rn = 1}. (Assume 0 < p < 1.) 


6.2 THE GAMBLER’S RUIN: PROBLEM 


Suppose that a gambler starts with a capital of x dollars and plays a sequence 
of games against an opponent with b — x dollars. At each trial he wins a 
dollar with probability p, and loses a dollar with probability g = 1 — p. 
(The trials are assumed independent, with O< p<1, O<2< 5b.) The 
process continues until the gambler’s capital reaches 0 (ruin) or b (victory). 
We wish to find h(x), the probability of eventual ruin when the initial capital 
is X. 

Let A = {eventual ruin}, B, = {win on trial 1}, B, = {lose on trial 1}. 
By the theorem of total probability, 


P(A) = P(B,)P(A| B,) + P(B2)P(A | Be) 


We are given that P(B,) = p, P(B,) = q; P(A) is the unknown probability 
h(x). Now if the gambler wins at the first trial, his capital is then x + 1; 
thus P(A | B,) is the probability of eventual ruin, starting at x + 1, that is, 
A(x + 1). Similarly, P(A | By) = A(x — 1). Thus 


h(x) = ph(x + 1) + gh(w — 1), =1,2,...,b—1 (6.2.1) 


[The intuition behind the argument leading to (6.2.1) is compelling; how- 
ever, a formal proof involves concepts not treated in this book, and will be 
omitted. | 

We have not yet found h(x), but we know that it satisfies (6.2.1), a linear 
homogeneous difference equation with constant coefficients. The boundary 
conditions are h(0) = 1, h(b) = 0. To see this, note that if x = 1, then with 
probability p the gambler wins on trial 1; his probability of eventual ruin is 
then h(2). With probability g he loses on trial 1, and then he is already ruined. 
In other words, if (6.2.1) is to be satisfied at x = 1, we must have A(O) = 1. 
Similarly, examination of (6.2.1) at x = b — 1 shows that h(b) = 0. 

The difference equation may be put into the standard form 


ph(x + 2) — A(e@ + 1) + qh(x) = 0,7 
za=0,1,...,5—2,4(0) = 1, h(6) =0 
It is solved in the same way as the analogous differential equation 


d*y dy 
ee ne | 
p da? dz + qy 
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We assume an exponential solution; for convenience, we take h(x) = A? 
(= e* ™4), Then pd*? — Att 4+ gi® = A*(p#? — 14+ q) =0. Since A* is 
never 0, the only allowable /’s are the roots of the characteristic equation 
pR—A+q = 0, namely, 


1 ee ees 
A=—(1+J1 — 4pq) 
2p 


Now 
(p — 9)? = p? — 2pq+ g? = p? + 2pq + Gg — 4pq = (p + 9)? — 409 
= 1 — 4pq (6.2.2) 
Hence | 
1 
=—(1+|p— ql) 
2p 


The two roots are 
jee asl mk ee [Vee mak are ga 
2p 2p p 
CAsE 1. p €q. Then A, and A, are distinct; hence 
h(z) = AA? + CA =A+C (‘) 
p 
h(0)=A+C=1 
q b 
Oe nee (4) 0 
p 


Solving, we obtain 


= _—(4/py C= ee 
1 — (q/p)’ 1 — (q/p) 
Therefore 
(q/p)* — (q/p)’ 
(sc) = S#HE2 MAE 6.2.3 
2 1 — (q/p)” ee 


CASE 2. p=q = 1/2. Then A, = A, = A = 1, a repeated root. In such a 
case (just as in the analogous differential equation) we may construct 
two linearly independent solutions by taking A” and xA’; that is, 

h(x) = Ad® + Cal*® = A+ Cx 
h(0)=A=1 
h(b) =A+Cb=0 soC=— 


& | R 


Thus 


(6.2.4) 
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h(x) 


6 
FiGure 6.2.1 Probability of Eventual Ruin. 


so that the probability of eventual ruin is the ratio of the adversary’s 
capital to the total capital. A sketch of h(x) in the various cases is 
shown in Figure 6.2.1. 


Similarly, let g(x) be the probability of eventual victory, starting with a 
capital of x dollars. We cannot conclude immediately that g(7) = 1 — A(x), 
since there is the possibility that the game will never end; that is, the 
gambler’s fortune might oscillate forever within the limits x = 1 and x = 
b — 1. However, we can show that this event has probability 0, as follows. 
By the same reasoning as that leading to (6.2.1), we obtain 


g(x) = pe(x + 1) + qg(x — 1) (6.2.5) 


The boundary conditions are now g(0) = 0, g(b) = 1. But we may verify 
that g(x) = 1 — h(®) satisfies (6.2.5) with the given boundary conditions; 
since the solution is unique (see Problem 1), we must have g(x) = 1 — A(x); 
that is, the game ends with probability 1. 

We should mention, at least in passing, the probability space we are work- 
ing in. We take Q = E”, # = the smallest sigma field containing the measur- 
able cylinders, R,(%,, %.,...) =v, i= 1,2,...,P the probability measure 
determined by the requirement that R,, R.,... be independent, with 
P{R, = 1} = p, P{R, = —1} = q. Thus R, is the gambler’s net gain on trial 
i, and x + >”, R; is his capital after n trials. We are looking for h(x) = 
P{for some n, «+ >7,R, =0, 0< e+ 5%,R, <b, K=1,2,..., 
n— l}. 

A sequence of random variables of the form x + Dae Reena, ea! 
where the R, are independent and have the same distribution function (or, 
more generally, Ry + >”%,R; n=1,2,..., where Ro, R,, Ro... are 
independent and R,, Re, ... have the same distribution function), is called a 
random walk, a simple random walk if R,(i > 1) takes on only the values +1. 
The present case may be regarded as a simple random walk with absorbing 
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barriers at 0 and b, since when the gambler’s fortune reaches either of these 
figures, the game ends, and we may as well regard his capital as forever 


frozen. 
We wish to investigate the effect of removing one or both of the barriers. 


Let h,(x) be the probability of eventual ruin starting from x, when the total 
capital is b. It is reasonable to expect that lim,_,.,. 4,(x) should be the prob- 
ability of eventual ruin when the gambler has the misfortune of playing 
against an adversary with infinite capital. Let us verify this. 

Consider the simple random walk with only the barrier at x = 0 present; 
that is, the adversary has infinite capital. If the gambler starts at x > 0, 
his probability h*(x) of eventual ruin is 


P(A) = P{for some positive integer b, 0 is reached before b} 
Let A, = {0 is reached before b}. The sets A,, b = 1, 2,... form an expand- 
ing sequence whose union is A; hence 
P(A) = lim P(A,) 
But aie 


P(A,) = h,(2) 
Consequently 


h*(x) = lim h,(2) 
b> 00 
=l1 ifq2p 
= (2) if q < p, by (6.2.3) and (6.2.4) (x = 1,2,...) 6.2.6) 
p 


Thus, in fact, lim, ,.. 4,(%) is the probability h*(x) of eventual ruin when the 
adversary has infinite capital; 1 — h*(x) is the probability that the origin will 
never be reached, that is, that the game will never end. Ifg < p, then h*(x) < 
1, and so there is a positive probability that the game will go on forever. 
Finally, consider a simple random walk starting at 0, with no barriers. 
Let r be the probability of eventually returning to 0. Now if R, = 1 (a win 
on trial 1), there will be a return to 0 with probability h*(1). If R, = —1 
(a loss on trial 1), the probability of eventually reaching 0 is found by evaluat- 
ing h*(1) with g and p interchanged, that is, 1 forg < p, and p/q if p < q. 
Thus, ifg <p, . 


r= (4) + q(1) = 2q 
Ifp <q, 
r= n() +4(?) = 2p 
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One expression covers both of these cases, namely, 


r=I1—|p—gq| 
<i] ifp Aq (6.2.7) 


PROBLEMS 


1. Show that the difference equation arising from the gambler’s ruin problem has a 
unique solution subject to given boundary conditions at x = 0 and x = b. 

2. In the gambler’s ruin problem, let D(x) be the average duration of the game 
when the initial capital is x. Show that D@) =p(1 + D@ +1)) +q(Q1 + 
D(x — 1)),% =1,2,...,5b —1 [the boundary conditions are D(0) = D(b) = 
0]. 

3. Show that the solution to the difference equation of Problem 2 is 


= 2 _ Og = py = @ip 
ay 1 — (ip)? 
= x(b — x) ifp =q =1/2 


[D(x) can be shown to be finite, so that the usual method of solution applies; 
see Problem 4, Section 7.4.] 

ReMARK. If D,(«) is the average duration of the game when the total capital is b, 
then lim,.,,, D,(«) (= © if p >q, =x/q — p if p <q) can be interpreted 
as the average length of time required to reach 0 when the adversary has 
infinite capital. 

4. In a simple random walk starting at 0 (with no barriers), show that the average 
length of time required to return to the origin is infinite. (Corollary: A couple 
decides to have children until the number of boys equals the number of girls. 
The average number of children is infinite.) 


5. Consider the simple random walk starting at 0. If b > 0, find the probability 
that x = b will eventually be reached. 


ifp #q 


6.3 COMBINATORIAL APPROACH TO THE RANDOM 
WALK; THE REFLECTION PRINCIPLE 


In this section we obtain, by combinatorial methods, some explicit results 
connected with the simple random walk. We assume that the walk starts at 
0, with no barriers; thus the position at time n is S, = >”, R,, where 
R,, R,,... are independent random variables with P{R, = 1} =p, 
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P{R, = —1} = q = 1 — p. We may regard the R,, as the results of an infinite 
sequence of Bernoulli trials; we call an occurrence of R, = 1 a “‘success,”’ 
and that of R, = —1 a “‘failure.”’ ( 

Suppose that among the first n trials there are exactly a successes and b 
failures (a + b =n); say a> b. We ask for the (conditional) probability 
that the process will always be positive at times 1,2,...,m, that is, 


P{S, > 0,5, >0,...,5, >0 |S, =a — b} (6.3.1) 


(Notice that the only way that S,, can equal a — b is for there to be a suc- 
cesses and b failures in the first n trials; for if x is the number of successes and 
y the number of failures, then x +y=n=a+b,x—y=a-— ); hence 
x=a, y= b. Thus {S, = a — b} = {a successes, b failures in the first n 
trials}.) 

Now (6.3.1) may be written as 


P{S, >0,...,S, >0,S, =a—b} 


ar emai (6.3.2) 


A favorable outcome in the numerator corresponds to a path from (0, 0) 
to (n, a — b) that always lies above the axis,f and a favorable outcome 
in the denominator to an arbitrary path from (0,0) to (n,a — b) (see 
Figure 6.3.1). Thus (6.3.2) becomes 


p*q’[the number of paths from (0, 0) to (n, a — b) that are always above 0] 
p%q’[the total number of paths from (0, 0) to (n, a — b)] 


A path from (0, 0) to (n, a — b) is determined by selecting a positions out 
of n for the successes to occur; the total number of paths is (") = (*t’). To 
count the number of paths lying above 0, we reason as follows (see Figure 
6.3.2). 

Let A and B be points above the axis. Given any path from A to B that 
touches or crosses the axis, reflect the segment between A and the first zero 
point 7, as shown. We get a path from A’ to B, where A’ is the reflection of 
A. Conversely, given any path from A’ to B, the path must reach the axis at 
some point T. Reflecting the segment from A’ to T, we obtain a path from A 
to B that touches or crosses the axis. The correspondence thus established is 
one-to-one; hence 


the number of paths from A to B that touch or cross the axis 
= the total number of paths from A’ to B 


+ Terminology: For the purpose of determining whether or not a path lies above the axis 
(or touches it, crosses it, etc.), the end points are not included in the path. 
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Sn 


n 


0 1 2 3 4 5 6 


Ficure 6.3.1 A Path in the Random Walk. 
n=6 
a= 4, b= 2 
P{R, = R, = 1, Rg = —1, Ry = Rz = 1, Re = —1} = pq’; 
this is one contribution to 
P{S, > 0,..., Sg > 0, Ss = 2} 


This is called the reflection principle. Now 
the number of paths from (0, 0) to (n, a — b) lying entirely above the axis 


== the number from (1, 1) to (n, a — b) that neither touch nor cross 
the axis (since R, must be +1 in this case) 


= the total number from (1, 1) to (n, a — b) — the number from 
(1, 1) to (n, a — b) that either touch or cross the axis 


== the total number from (1, 1) to (n, a — b) — the total number 
from (1, —1) to (”, a — b) (by the reflection principle) 


_ (n- ] _ [n= ‘) 
(" —1 ( a 
[Notice that in a path from (1, 1) to (n, a — b) there are x successes and y 
failures, where xv t+ y=n—1l1=a+b—1,2 — y=a — b — 1,50 


B 


FIGURE 6.3.2 Reflection Principle. 
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z=a—1,y =D. Similarly, a path from (1, —1) to (n, a — b) must have a 
successes and b — 1 failures. ] 


——_—_——_——~ sO 


~(a—D!b! al(b— 1D! a! b! 


Thus 
ifa + b =n, the number of paths from (0, 0) to (n, a — b) 


lying entirely above the axis = ee. (") (6.3.3) 
n a 


Therefore 
—b _a-—b 
a+b 


P{S, > 0,...,.8, >0|S, =a—b} = (6.3.4) 


REMARK. This problem is equivalent to the ballot problem: In an election 
with two candidates, candidate 1 receives a votes and candidate 2 
receives b votes, with a > b, a + b = n. The ballots are shuffled and 
counted one by one. The probability that candidate 1 will lead 
throughout the balloting is (a — b)/(a + 6). [Each possible sequence 
of ballots corresponds to a path from (0, 0) to (n, a — b); a sequence 
in which candidate 1 is always ahead corresponds to a path from 
(0, 0) to (n, a — b) that is always above the axis.] 


We now compute 
h; = P{the first return to 0 occurs at time j} 
Since h; must be 0 for 7 odd, we may as well set 7 = 2n. Now 
hoy, = P{S, A 0,..., Son1 FO, Son = 0} 


and thus /,,, is the number of paths from (0, 0) to (2n, 0) lying above the axis, 
times 2 (to take into account paths lying below the axis), times pq”, the 
probability of each path (see Figure 6.3.3). 

The number of paths from (0, 0) to (2n, 0) lying above the axis is the num- 
ber from (0,0) to (Qn — 1, 1) lying above the axis, which, by (6.3.3), is 
(2%) (a — b)/(2n — 1) (where a+b = 2n—1,a—b=1, hence a=n, 


FiGure 6.3.3 A First Return to 0 at Time 27. 
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(y+2k,y) 


FiGureE 6.3.4 Computation of First Passage Times. 


b =n — 1), that is, 
(">") 1 ' (Qn — 2)! = aes 


n }%—1 ni—D! n 


Thus 
2/(2n — 2 
hae = 2 ("C00 (6.3.5) 


We now compute probabilities of first passage times, that is, P{the first 
passage through y > 0 takes place at time r}. The only possible values of r 
are of the form y + 2k, k = 0,1,... ; hence we are looking for 

ee = P{the first passage through y > 0 occurs at time y + 2k} 
To do the computation in an effortless manner, see Figure 6.3.4. If we look 
at the path of Figure 6.3.4 backward, it always lies below y and travels a 
vertical distance y in time y + 2k. Thus the number of favorable paths is the 
number of paths from (0, 0) to (y + 2k, y) that lie above the axis; that is, 
by (6.3.3), 


Pe wherea +b=y+2k,a—b=y 


a y + 2k 
Thus a= y +k, b = k. Consequently 
2k 
hyrox = a "i i; pra" (6.3.6) 
PROBLEMS 


The first five problems refer to the simple random walk starting at 0, with no 
barriers. . 

1. Show that P{S, >0,S, >0,..., Son 20, Son =O} = Ugn/(n + 1), where 
Uon = P{Son = 0} is the probability of m successes (and n failures) in 2m Bernoulli 
trials, that is, (2”)(pq)”. 
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2. Let p =q = 1/2. 
(a) Show that hon = Ugn—9/2Nn, where Yon = (2”)(1/2)?”. 
(b) Show that uy,,/uon_» = 1 — 1/2n; hence hy, = ten» — Uon- 

3. Ifp =q = 1/2, show that P{S, 4 0,..., So, # 0}, the probability of no return 
to the origin in the first 27 steps, is uz, = (2”)2~*". Show also that P{S, #0,..., 
Sona # 0} = hon + Usp. 

4. If p =q = 1/2, show that P{S, > 0,..., Sen > O} = (2”)2-™". 

5. If p =q = 1/2, show that the average length of time required to return to the 
Origin is infinite, by using Stirling’s formula to find the asymptotic expression 
for hy, and then showing that $°_, nhg, = ©. 

6. Two players each toss an unbiased coin independently n times. Show that the 
probability that each player will have the same number of heads after n tosses 


is (1/2)" Se_, (m?. 
7. By looking at Problem 6 in a slightly different way, show that }”_, (7)? = (2”). 


8. A spider and a fly are situated at the corners of an 7 by n grid, as shown in 
Figure P.6.3.8. The spider walks only north or east, the fly only south or west; 


_ Fly 1 Fly 


Spider 
FIGURE P.6.3.8 


they take their steps simultaneously, to an adjacent vertex of the grid. 

(a) Show that if they meet, the point of contact must be on the diagonal D. 

(b) Show that if the successive steps are independent, and equally likely to go 
in each of the two possible directions, the probability that they will meet is 


2") (1/2)". 


6.4 GENERATING FUNCTIONS 


Let {a,,n > 0} be a bounded sequence of real numbers. The generating 
function of the sequence is defined by 


A(z) = > a,2", 2 complex 


n=0 
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The series converges at least for |z| < 1. If R is a discrete random variable 
taking on only nonnegative integer values, and P{R = n}=a,, n= 0, 
1,..., then A(z) is called the generating function of R. Note that A(z) = 

mg 2” P{R = n} = E(z*), the characteristic function of R with z replacing 


n=0 
e772 Uu 


We have seen that the characteristic function of a sum of independent 


random variables is the product of the characteristic functions. An analogous 
result holds for generating functions. 


Theorem 1. Let {a,} and {b,} be bounded sequences of real numbers. Let 
{c,} be the convolution of {a,} and {b,}, defined by 


Cc, = Yabo (= > b,a,-1) 
4=0 j=0 
Then C(z) = > _, C,2” is convergent at least for |z| < 1, and 
C(z) = A(z)B(z) 

Proor. Suppose first that a, = P{R, = n}, b, = P{R, =n}, where R, 
and R, are independent nonnegative integer-valued random variables. Then 
Cn, = P{R, + R, = n}, since {R, + R, =n} is the disjoint union of the 
events {R, =k, R,=n—k},k =0,1,...,n. Thus 

C(z) = E(z®t#2) = E(zigk:) = E(2™)E(z®) = A(z)B(z) 
In general, 


Yez" => Ya,b,.z" => ( by") az = A(2)B(2) 
k 


We have seen that under appropriate conditions the moments of a random 
variable can be obtained from its characteristic function. Similar results hold 
for generating functions. Let A(z) be the generating function of the random 
variable R; restrict z to be real and between 0 and 1. We show that 


E(R) = A’(1) (6.4.1) 
where 
A’(1) = lim A’(z) 
zl 
If E(R) is finite, then the variance of R is given by 
Var R= A"(1) + A’(1) — [A’(DF (6.4.2) 
To establish (6.4.1) and (6.4.2), notice that 


A(z) = > ane" a, = P{R =n} 
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Thus 
[e @) 
A'(z) = } na,2”"* 


n=1 


Let z — 1 to obtain 


A'(1) = > na, = E(R) 
proving (6.4.1). Similarly, ae 


A"(z) = y n(n — 1)a,2”*, so A"(1) = E(R*) — E(R) 


Therefore 
Var R = E(R®) — [E(R)? = A"(1) + A’) — [4’P 


which is (6.4.2). 

Now consider the simple random walk starting at 0, with no barriers. Let 
u, = P{S, = 0}, h, = the probability that the first return to 0 will occur at 
time nm = P{S, ~0,...,S,_, 4 0, S, = 0}. Let 


Ue) = Yue" HE) => he" 


(For the remainder of this section, z is restricted to real values.) 

If we are at the origin at time 7, the first return to 0 must occur at some time 
k,k = 1,2,...,n. If the first return to 0 occurs at time k, we must be at the 
origin after n — k additional steps. Since the events {first return to 0 at 


time k}, kK = 1,2,...,n, are disjoint, we have 
n 
u, = > Ayvn—w Nee 1D es 
k=1 


Let us write this as 
n 
uy = > yt —ns nri= 1, 2. eer 
k=0 


This will be valid provided that we define yj = 0. Now uw, = 1, since the 
walk starts at the origin, but Agu) = 0. Thus we may write 


i= hd Wee O yes: (6.4.3) 


where v, = u,,n > 1; v, =O=u — 1. 
Since {v,} is the convolution of {h,,} and {u,}, Theorem 1 yields 


V(z) = H(z)U(z) 
But 


V(z) = 2 Pai = 2 tat” = U(z) — | 
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Thus 
U(z)\(1 — A(z)) = 1 


We may use (6.4.4) to find the A,, explicitly. For 


Ue) =D uae" = Sart = > (°")(pa)"2" 


This can be put into a closed form, as follows. We claim that 


ca) (aoe 
n 
where, for any real number «, (*) denotes 

a(a —1)°---(a—n+1) 


n! 
To see this, write 


(“2”) _ (=1/2)(—3/2) ++: [-@n — 0/21 


n n! 
ahi n!2” n! 
i + (Qn — 1) (2/2)(4/2)(6/2) - - - (2n/2) (— 
n! n! 2” 
_ (n)!(—1)" 
~ ontnt 4" 


proving (6.4.5). Thus 
U@) => (1) (—Apaety” = 1 = Angsty" 


by the binomial theorem. By (6.4.4) we have 


H(z) =1— es ee (1 — 4pqz")'? 


U(z) 


(6.4.4) 


(6.4.5) 


1)” 


(6.4.6) 


This may be expanded by the binomial theorem to obtain the /,, (see Problem 
1); of course the results will agree with (6.3.5), obtained by the combinatorial 
approach of the preceding section. Notice that we must have the positive 


square root in (6.4.6), since H(0) = hy = 0. 


Some useful information may be gathered without expanding H(z). 
Observe that H(1) = >°_,A, is the probability of eventual return to 0. 
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By (6.4.6), 
H(1) = 1 — (1 — 4pq)"” 


=1—|p—ql __ by (6.2.2) 


This agrees with the result (6.2.7) obtained previously. 

Now assume p = g = 1/2, so that there is a return to 0 with probability 1. 
We show that the average length of time required to return to the origin is 
infinite. For if T is the time of first return to 0, then 


00 


E(T) zs nP{T =n} => nh, = H'(1) 
as in (6.4.1). By (6.4.6), - 


H'(z) = -* (1 — 2?)/? = 271 — 2?) #9 > 00 as z—> 1 
Zz 


Thus E(T) = oo, as asserted (see Problem 5, Section 6.3, for another ap- 
proach). 


PROBLEMS 


1. Expand (6.4.6) by the binomial theorem to obtain the h,. 


2. Solve the difference equation a,,, — 3a, = 4 by taking the generating function 
of both sides to obtain 
4z ao 
eee + re 
dad —z)(1 —3z)' 1— 3z 
Expand in partial fractions and use a geometric series expansion to find a,,. 


3. Let A(z) be the generating function of the sequence {a,}; assume that 
yo lan — An1| < ©. Show that if lim, , . a, exists, the limit is 
lim (1 — z)A(2) 
el 
4. If R is a random variable with generating function A(z), find the generating 
function of R + k and KR, where k is a nonnegative integer. If F(n) = P{R <n}, 
find the generating function of {F(n)}. 


5. Let R,, Ro, ... be independent random variables, with P{R; = 1} = p, P{R; = 
0} =q =1 —p,i=1,2,.... Thus we have an infinite sequence of Bernoulli 
trials; R; = 1 corresponds to a success on trial i, and R; = Oisa failure. (Assume 
0 <p <1.) Let R be the number of trials required to obtain the first success. 
(a) Show that P{R =k} =q*p,k =1,2,.... 

(b) Use generalized characteristic functions to show that E(R) = 1/p, Var R = 
(1 — p)/p?; check the result by calculating the generating function of R 
and using (6.4.1) and (6.4.2). R is said to have the geometric distribution. 


A(z) = 
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6. With the R, as in Problem 5, let N, be the number of trials required to obtain 
the rth success (r = 1, 2,...). 
(a) Show that P{N, =k} = (lpg, k =r,r+1,... 


= G")p'(-9)*, K=r,r+1,... 


where () is defined as (—r)(—r — 1)---(-r—-j+t+I))/!, 7 =1,2,..., 
(>) =1. 

(b) Let 7, = the number of trials required to obtain the first success, JT, = the 
number of trials following the first success up to and including the second 
success,..., 7, = the number of trials following the (r — 1)st success up 
to and including the rth success (thus N, = T, +++: + T;,). Show that the 
7; are independent, each with the geometric distribution. 

(c) Show that E(N,) =r/p, Var N, =r(1 —p)/p®. Find the characteristic 
function and the generating function of N,. N, is said to have the negative 
binomial distribution. 

7. With the R; as in Problem 5, let R be the length of the run (of either successes 

or failures) started by the first trial. Find P{R =k}, k =1,2,..., and E(R). 

8. In Problem 6, find the joint probability functions of N, and N,; also find (in a 
relatively effortless manner) the correlation coefficient between N, and Np. 


6.5 THE POISSON RANDOM PROCESS 


We now consider a mathematical model that fits a wide variety of physical 
phenomena. Let T,, T,, ... by a sequence of independent random variables, 
where each T; is absolutely continuous with density f(x) = Ae~**, x > 0; 
f(%) = 0, «x < 0 (Ais a fixed positive constant). Let A, = T, +-°--4+ T,, 
n=1,2,....Wemay think of A, as the arrival time of the nth customer at a 
serving counter, so that T,, is the waiting time between the arrival of the 
(n — 1)st customer and the arrival of the mth customer. Equally well, A, 
may be regarded as the time at which the nth call is ‘made at a telephone 
exchange, the time at which the nth component fails on an assembly line, or 
the time at which the nth electron arrives at the anode of a vacuum tube. 

If t > 0, let R, be the number of customers that have arrived up to and 
including time ¢; that is, R, =n if A, << t< Aju, (n=0,1,...;5 define 
Ay = 0). A sketch of (R;, ¢ > 0) is given in Figure 6.5.1. 

Thus we have a family of random variables R,, t > 0 (not just a sequence, 
but instead a random variable defined for each nonnegative real number). 
A family of random variables R,, where ¢ ranges over an arbitrary set I, is 
called a random process or stochastic process. Note that if [is the set of positive 
integers, the random process becomes a sequence of random variables; if 
I is a finite set, we obtain a finite collection of random variables; and if J 
consists of only one element, we obtain a single random variable. Thus the 
concept of a random process includes all situations studied previously. 
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Ry 
4 on 
3 en Pp =An~An-1 
2 ——- 
1 ae 
0 
0 A, Ag Az, A, t 


FIGURE 6.5.1 Poisson Process. 


If the outcome of the experiment is w, we may regard (R,(w), t¢€ J) as a 
real-valued function defined on J. In Figure 6.5.1 what is actually sketched is 
R,(@) versus t, t€] = the nonnegative reals, for a particular w. Thus, 
roughly speaking, we have a ‘“‘random function,” that is, a function that 
depends on the outcome of a random experiment. 

The particular process introduced above is called the Poisson process 
since, for each t > 0, R, has the Poisson distribution with parameter Az. 
Let us verify this. 

If k is a nonnegative integer, 


P{R, < k} = P{at most k customers have arrived by time ¢} 
== P{(k + 1)st customer arrives after time tr} 
= PIT, +e + Thy >} 


But A,wy = T, +-°> + Ty. is the sum of k +1 independent random 
variables, each with generalized characteristic function {> Ae~**e-** dx = 
A/(s + A), Res > —A; hence A,,, has the generalized characteristic function 
[A/(s + A)}*+1, Re s > —A. Thus (see Table 5.1.1) the density of A,,, is 


Farr (&) = 7 Mer rakte"*u(a) (6.5.1) 


where u is the unit step function. Thus 


=|" 1 gett k-ae We 
t k! 


a a Jkt1yk d aes 
t k!} A 


= . (At)e~4? + | = Akgk-e4*dz (integrate by parts) 
i t — : 
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Integrating by parts successively, we obtain 


P{R, < k}= SZ —At ve 


7~=0 


Hence R, has the Poisson distribution with parameter Az. 

Now the mean of a Poisson random variable is its parameter, so that 
E(R,) = At. Thus A may be interpreted as the average number of customers 
arriving per second. We should expect that A~! is the average number of 
seconds per customer, that is, the average waiting time between customers. 
This may be verified by computing E(7;). 


E(T) = | ee ee 
0 


> 


We now establish an important feature of the Poisson process. Intuitively, 
if we arrive at the serving counter at time ¢ and the last customer to arrive 
came at time ¢ — h, the distribution function of the length of time we 
must wait for the arrival of the next customer does not depend on A, and 
in fact coincides with the distribution function of T,. Thus we are essentially 
starting from scratch at time ¢; the process does not remember that h 
seconds have elapsed between the arrival of the last customer and the present 
time. 

If W, is the waiting time from ¢ to the arrival of the next customer, we wish 
to show that P{W, < z} = P{T, < z},z > 0. We have 


P{W, < 2} = Pifor some n = 1, 2,..., the mth customer arrives in 


(t, t + 2] and the (n + 1)st customer arrives after time ¢ + 2} 
= P| Ult<A,<t4+2< Ans} (6.5.2) 
n=1 


(see Figure 6.5.2). To justify (6.5.2), notice thatift< A, <<t+2< Ani 
for some n, then W, < z. Conversely, if W, < z, then some customer arrives 
in (t, tf + 2] and hence there will be a last customer to arrive in the interval. 
(If not, then >*°_, T,< 0; but this event has probability 0; see Problem 1.) 
Now P{t << A, << t+2< Ay} = Pit< A, Ct+2,A, 4+ Tayi >t + 2}. 


| An | Ant1 


t t+z 
FIGURE 6.5.2 
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Since A,(= T, + +--+ T,) and T,,,, are independent, we obtain, by (6.5.1), 


Pl{t< A, St+2<A,a}= {| a Ara” re **e~*Y da dy 


t<gSt+z, 
aty >t+z2 


A” t+z 8) 
| gre 4 [ he~*” dy dx 
t t 


7 (n casas 1)! +2—2 


A” t+z2 
oe glo Ato Alt+2—a) dx 
(n — 1)! Je 


Sus arta eaas ae z)” oe t”] 


Since >°_,r”/n! = e”, (6.5.2) yields 
PiW, < zt pees gree) | Ane) a eo (e# ae 1)] 


ae 1 2 e 2 
Thus W, has the same distribution function as 7}. 
Alternatively, we may write 


n 


PW, < #} = P| Un St< Ana St +3} 
=0 


For if A, <t< Aji, St +2 for some n, then W,<z. Conversely, if 
W,< z, then some customer arrives in (ft, ¢ + 2], and there must be a first 
customer to arrive in this interval, say customer n+ 1. Thus 4, <t< 
Anti < t+ 2for some m > 0. An argument very similar to the above shows 
that 

(Ar)” 


P{A, St< Anu Stt+4ea ane (e744 — e Ai ta)) 


and therefore P{W, < z} = 1 —e™** as before. In this approach, we do 
not have the problem of showing that P| pa! i | == 0. 


n=1 


To justify completely the statement that the process starts from scratch 
at time t, we may show that if V,, V.,... are the successive waiting times 
starting at ¢ (so V, = W,), then V,, V,,... are independent, and V, and 
T, have the same distribution function for all i. To see this, observe that 


PIV, J %,..., Vy, < 2} 


=P U fay St < Agi St + 21 Tass S Ba5---5 Trae S| 


n=0 
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For it is clear that the set on the right is a subset of the set on the left. Con- 
versely, if Vij << x,,...,V;,< %,, then a customer arrives in (ft, t + 24], 
and hence there is a first customer in this interval, say customer n + 1. 
Then A, << t< Aji, << t+ %x,, and also V;=T,,,, i = 2,...,k, as 
desired. Therefore 


PAV Siti, = eS |S PlAn St < Ana S t+ a | 
k 
x IT PfT.4< 2} 
4=2 
; k 
= P{W,< ,} I] P{T,< x;} 
i=2 


k 
=IIp {T; < %,} 
i=1 
Fix j and let x;,—> 00, i # j, to conclude that P{V; < xj} = P{T; < 2;}. 
Consequently 


k 
P{V, < %,...,U, < 2,} = LI PLY, < 23 
i=1 


and the result follows. In particular, the number of customers arriving in 
the interval (t, ¢ + 7] has the Poisson distribution with parameter Ar. 

The “memoryless’’ feature of the Poisson process is connected with a basic 
property of the exponential density, as the following intuitive argument 
shows. Suppose that we start counting at time f, and the last customer to 
arrive, say customer m — 1, came at time ¢ — h. Then (see Figure 6.5.3) the 
probability that W, < z is the probability that T, < z +h, given that we 
have already waited A seconds; that is, given that T,, > h. Thus 


P{W, <2} = P{T, Sz +h|T, > h} 
_ PIh<T, <2 +h} 
PAT, > h} 


2+h 
i Ae ?* dx 
h 


| he~** dx 
h 

—Ah _ p—Alzth) 
= Set oo PIT, <2} 


ewan 


so that W, and T,, have the same distribution function (for any n). 
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FIGURE 6.5.3 


The key property of the exponential density used in this argument is 


P{T, <2 +h|T, > h} = P{T, < 2} 
that is, 
P{T, >z+h|T,, > h} = P{T, > 2} 


or (since {7, >, T, >2 +h} = {T, >2+h}) 
PIT, > 2 + h} = PIT, > 2P{T, > hj 
In fact, a positive random variable T having the property that 
P{T>2+h} = P{T > 2P{T > h} for allz,h > 0 
must have exponential density. For if G(z) = P{T > z}, then 


G(z + h) = G(z)G(h) 
Therefore 


et SY = G(z) ae = G(z) <_a7 since G(O) = 1 


Let 4 — 0 to conclude that 
G’(z) = G'(0)G(z) 


This is a differential equation of the form 


d i ; 
“Fa dy=0 (A= —G(0)) 
dx 


whose solution is y = ce~**; that is, 
G(z) = ce” 
But G(O) = c = 1; hence the distribution function of T is 


F A(z) =1—Gz)=1—e", 2>0 
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and thus the density of T is 


frp@® =e’, 220 

as desired. 

(The above argument assumes that T has a continuous density, but actually 
the result is true without this requirement.) 

The memoryless feature of the Poisson process may be used to show that 
the process satisfies the Markov property: 

If 04 < tp < +++ < tay, and a,...,4,,, are nonnegative integers 
with a, < +--+ < a,,4, then 

PiR, a Ant | Ri, = Ay,--->; R,, = An} ao P{R: — Ant | Ry = An} 


n+1 


Thus the behavior of the process at the “future’’ time ¢,,,, given the be- 
havior at “‘past” times ¢,,..., ¢,_, and the “present” time ¢,, depends only 
on the present state, or the number of customers at time ¢,. For example, 


P{R,, = 15| R,, = 0, R, = 3, R,, = 8} = P{R,, = 15| R,, = 8} 
== the probability that exactly 7 customers will arrive between t, and ft, 


ptt) [A(t, — ts)]' 
7! 


This result is reasonable in view of the memoryless feature, but a formal proof 
becomes quite cumbersome and will be omitted. 

We consider the Markov property in detail in the next chapter. 

We close this section by describing a physical approach to the Poisson 
process. Suppose that we divide the interval (¢, tf + 7] into subintervals of 
length At, and assume that the subintervals are so small that the probability 
of the arrival of more than one customer in a given subinterval J is negligible. 
If the average number of customers arriving per second is A, and R; is the 
number of customers arriving in the subinterval J, then E(R,) = 4 At. 
But E(R;) = (O)P{R; = 0} + (DP{R; = 1} = P{R; = 1}, so that with 
probability 2 Az a customer will arrive in J, and with probability 1 — 4 At 
no customer will arrive. 

We assume that if I and J are disjoint subintervals, R; and R,; are inde- 
pendent. Then we have a sequence of n = 7/At Bernoulli trials, with prob- 
ability p = A At of success on a given trial, and with Ar very small. Thus 
we expect that the number N(t, ¢ + 7] of customers arriving in (¢, ¢ + 7] 
should have the Poisson distribution with parameter Ar. Furthermore, if 
W, is the waiting time from ¢ to the arrival of the next customer, then 
P{W, > x} = P{N(t,t + «] = 0} =e-**, so that W, has exponential 
density. 
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PROBLEMS 


1. Show that P{}~_, T, < o} =0. 


2. Show that the probability that an even number of customers will arrive in the 
interval (¢, ¢ + 7] is (1/2)(1 + e~*47) and the probability that an odd number of 
customers will arrive in this interval is (1/2)(1 — e~?47). 


3. (The random telegraph signal) Let T,, 72, . . . be independent, each with density 
de~4y(x). Define a random process R;, t > 0, as follows. 
Ry = +1 or —1 with equal probability (assume Ry independent of the 7;) 


R,=R,0<t<T, 
R, = Ry, Ty + Ty St <T, + Ty + Ts 


R, = (—1)"Ry, Ty +-°° + Tn St < Ty tes + Tay 


(see Figure P.6.5.3). 


FIGuRE P.6.5.3 


(a) Find the joint probability function of R, and R,,, (t, 7 > 0). 
(b) Find the covariance function of the process, defined by K(t, 7) = Cov (R,, 
R,,,)> t, T > 0. 
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In this section we show that the arithmetic average (R, +--+ + R,)/nofa 
sequence of independent observations of a random variable R converges 
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with probability 1 to E(R) (assumed finite). In other words, we have 


a eee 


NO n 


= E(R) 


for all w, except possibly on a set of probability 0. We shall see that this is a 
stronger convergence statement than the weak law of large numbers. 

Let (Q, 4, P) be a probability space, fixed throughout the discussion. 
If A,, Ag,... is a sequence of events, we define the upper limit or limit 
superior of the sequence as 


lim sup 4, =M UA, (6.6.1) 
n n=1 k=n 
and the Jower limit or limit inferior of the sequence as 
lim inf A, = U NA, (6.6.2) 
n n=1 k=n 


Theorem1. limsup, A, = {w: o € A, for infinitely many n}, lim inf, A, = 
{w: w € A, eventually, i.e., for all but finitely many n}. 


Proof. By (6.6.1), w € lim sup A, iff, for every n, w € U,.,, A;; that is, 
for every n there isa k > n such that w € A,, or w € A, for infinitely many n. 
By (6.6.2), w € lim inf, A, iff, for some n, w € (),_,, 4x; that is, for some n, 
w € A, for all k > n, or w € A,, eventually. 


Theorem 2. Let R, Ry, Rs... be random variables on (Q, F, P). 
Denote by {R,, — R} the set {w: lim, ,.. Rn(w) = R(w)}. Then {R, > R} = 
Nin—1 Lim inf, Anm, Where Anm = {@:|R,(@) — R(w)| < 1/m}. 


Proof. R,(w)— R(w) iff for everym = 1,2,...,|R,(@) — R(w)| <1/m 
eventually, that is (Theorem 1), for every m= 1,2,...weliminf, A,,,. 


We say that the sequence R,, Rz,... converges almost surely to R (nota- 


tion: R,, — R) iff P{R, — R} = 1. The terminology “‘almost everywhere”’ is 
also used. 


Theorem 3. R, SR iff for every e > 0, P{|R, — R| > ¢ for at least 
onek > n}—Oasn— o. 


Proof. By Theorem 2, P{R,— R} < Pilim inf, A,,,} for every m; 
hence R, ——> R implies that lim inf, A,,, has probability 1 for every m. 
Conversely, if P(lim inf, A,,,) = 1 for all m, then, by Theorem 2, {R,, — R} 
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is a countable intersection of sets with probability 1 and hence has probability 


1 (see Problem 1). 
Now in (6.6.2), (\y_, 42. 4 = 1,2,... 18 an expanding sequence; hence 


P(lim inf, A,) = lim, ,., P(N, 4x). Thus 
R,->R iff P(lim,inf A,,,)=1  forallm=1,2,... 
iff Him P( 1) vm) = 1 for all m =1,2,... 


N00 k=n 
iff PIR — RL<~ for all k > n} —> 1 as n> oc 
m 


for allm = 1,2,... 


—_ 


iff PIR, — R| > for at least one k > n| —+(0asn— oo 


m 
for allm = 1,2,... 
iff P{|R,—R|>e} for atleast onek >n}—Oasn—o 
for alle > 0 
(see Problem 2). 


CoroLLaRY. R,—>R implies R,, —,R. 


Proor. R,, *, R iff for every e > 0, P{|R, — R| > e} ~0 as n> ew 
(see Section 5.4). Now {|R, — R| > efor atleast one k > n} > {|R, — R| > 
e}, so that P{|R, — R| > e« for at least one k > n} > P{|R, — R| > ec}. The 
result now follows from Theorem 3. 


For an example in which R,, —+ R but R,, +> R, see Problem 3. 


Theorem 4 (Borel-Cantelli Lemma). If A,, Ag, ... are events in a given 
probability space, and >°_, P(A,) < 0, then P(lim sup, A,) = 0; that is, 
the probability that A,, occurs for infinitely many n is 0. 


Proor. By (6.6.1), 


P (tim sup Ay) <P(U A,) for every n 


k=n 


P(A,) __ by (1.3.10) 


n—1 


P(A,) — > P(A,) 7-90 3 asn—-o 
1 1 


IA 
M8 


os 
8 | 
3 


k 


since >, P(A,) < 00. 
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Theorem 5. If for every ¢>0, >@, P{|R, — R| > e} < 0, then 
R, 2% R. 


Proor. By Theorem 4, lim sup, {|R, — R| > ¢} has probability 0 for 
every ¢ > 0; that is, the probability that |R, — R| > « for infinitely many 
n is 0. Take complements to show that the probability that |R, — R| > « 
for only finitely many n is 1; that is, with probability 1, |R, — R| <« 
eventually. Since ¢ is arbitrary, we may set ¢ = 1/m, m= 1,2,.... Thus 
P(lim inf, An») = form = 1,2,..., and the result follows by Theorem 2. 


Theorem 6. If >°_, E[|R, — R\*] < © for some k > 0, then R,, ie «J 


Proor. P{|R, — R| > e} < E[|R, — RI*]/e* by Chebyshev’s inequality, 
and the result follows by Theorem 5. 


Theorem 7 (Strong Law of Large Numbers). Let R,, Ro, .. . be independent 
random variables on a given probability space. Assume uniformly bounded 
fourth central moments; that is, for some positive real number M, 


E(|R; — E(R,)|*] <M 
for alli. Let S, = Ry +++: + R,. Then 
S, — E(Sn) een 


n 


) 


In particular, if E(R;) = m for alli, then E(S,) = nm; hence 
S, as. 
——>m 
n 
ProoF. Since S, — E(S,) = >%, R;, where 
Ri=R,—E(R), E(R!)=0, — E(IRiI*) = EIR, — E(RII'1< M < 


we may assume without loss of generality that all E(R,) are 0. We show that 
*_, E[(S,/n)*] < oo, and the result will follow by Theorem 6. Now 


S,f = (Ri +++ + R,) 
” 4IR ZR,” 4! 


— R,* + + R, °R,R, + 41 R,R,RiR,, 
») Parcs 212) fDi : oe, 
j<k sae 
oY ety ed 


j#n 311! 
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But 
E(R;?R,,R,) = E(R;*)E(R,)E(R,) by independence 
= 0 
Similarly, 
E(R;R, RR») = ECR?R;,) = 0 
Thus 


B(S,!) = > E(RS) + > 6E(R;DE(R,) 


By the Schwarz inequality, 
E(R;?) = E(R? : 1) < [E(R)E(12)}2 < M12 


Hence | 
E(S,;) < nM + 6 n= M = (3n? — 2n)M < 3n?M 
Consequently 
Pa a ae ke 


The theorem is proved. 


REMARKS. If all R,; have the same distribution function, it turns out that the 
hypothesis on the fourth central moments can be replaced by the 
assumption that the E(R,) are finite [of course, in this case E(R;,) is 
the same for all i]. In general, the hypothesis on the fourth central 
moments can be replaced by the assumption that for some M and 
6 > 0, E[|R; — E(R)|***] < M for all i. 


Now consider an infinite sequence of Bernoulli trials, with R, = 1 if there 

is a success on trial i, R; = 0 if there is a failure on trial 7. Then 

Da. Ra PR, 

no n 
is the relative frequency of successes in vn trials and E(R;) is the probability 
p of success on a given trial. The strong law of large numbers says that if we 
regard the observation of all the R; as one performance of the experiment, 
the relative frequency of successes will almost certainly converge to p. The 
weak law of large numbers says only that if we consider a sufficiently large 
but fixed n (essentially regarding observation of R,,...,R, as one per- 
formance of the experiment), the probability that the relative frequency will 
be within a specified distance ¢ of p is > 1 — 6, where 6 > 0 is preassigned. 
The requirements on n will depend on « and 6, as well as p. Recall that, by 
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Chebyshev’s inequality, 


fs — E(S,) | E[(S, — ES,)*{n*]_ Var S, _ V2: 
Pier a Se) en ee es 
n Ee ne” n*¢” 
p(l — p) 
ne 


Thus 


P| — 7 < | >1-— eae 2 >1-—6 for large enough n 
n ne 


PROBLEMS 


1. Show that a countable intersection of sets with probability 1 still has probability 


Be 


1. Does this hold for an uncountable intersection? 


If P{|R, — R| > « for at least one k > n} +0 asn — o- for all ¢ of the form 
1/m, m =1,2,..., show that this holds for all « > 0. 


. Let Ry be uniformly distributed on the interval (0, 1]. Define the following 


sequence of random variables. 
R, = 2;(Rp) = 1 
Roy =a(Ro) =1 if0 < Rg <4; Ry, = 0 otherwise 
Ro, = 1 iff4<R <1, 0 otherwise 
Ry, = 1 if0 < Ry <4, 0 otherwise 
Ryo =1 iff < Ry <3, 0 otherwise 
Rs, = 1 if? < Ry <1, 0 otherwise 


In general, let 


0 otherwise 


(see Figure P.6.6.3 for n = 4). 
The fact that we are using two subscripts is unimportant. We may arrange 
the Rym as a single sequence. 


Ry, Roy, Ree, Rei, Ree, Rag, Rar» Rag, Raz, Raa, etc. 


Show that the sequence converges in probability to 0, but does not converge 
almost surely to 0. In fact, P{Rnm — 0} = 0. 
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FIGURE P.6.6.3 


. Let A,, Ag, ... be an arbitrary sequence of events in a given probability space. 

(a) Show that lim inf, A, < lim SUPn An. 

(b) If the A, form an expanding sequence whose union is A, or a contracting 
sequence whose intersection is A, show that lim inf, A, = lim sup, A, = A. 

(c) In general, if lim sup, A, = lim inf, A, = A, we say that A is the limit 
of the sequence {A,} (notation: A = lim, A,). Give an example of a 
sequence that is not eventually expanding or contracting (i.e., that does 
not become an expanding or contracting sequence if an appropriate finite 
number of terms is omitted), but that has a limit. 

(d) If A =lim, 4,, show that P(A) =lim,..,. P(A,). HINT: ().,, 4x 
expands to lim inf A,, U7?_,, 4; contracts to lim sup A,, and (.)2_, Ax < 
A, © U2, An 

(ce) Show that (lim inf, A,)° = lim sup, A¢, and (lim sup, A,)° = lim inf, A,’. 

. Find lim sup, A, and lim inf, A, if Q = E1 and 


1 
A, = fo, 1 —- | if 2 is even 


1 
= [1.5] if n is odd 
n 


. Let Q = E* and take A, = the interior of the circle with center at ((—1)"/n, 0) 
and radius 1. Find lim sup, A, and lim inf, Ap. 


. Let 7, Xp, . .. bea sequence of real numbers, and let A, = (— ©, x,). What is 
the connection between lim sup, %, and lim sup, A, (similarly for lim inf)? 

- (Second Borel-Cantelli lemma) If A,,A,,... are independent events 
and ar P(A,) = ©, show that P(lim sup, A,) = 1. HINT: Show that 


P(lim sup, A,) = lim,, 0 limm.P(U™, 4, and consider (U™,, Ay)’; 
use the fact that e~* > 1 — =. 


- Let R,, R,, ... be a sequence of independent random variables, all defined on 
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the same probability space. Let c be any real number. Show that R,, cit 
and only if for every « > 0, DSi P{|R, — cl > e} < o. 


Let R,, Ro, ... be a sequence of independent random variables, and let c be 


any real number. Show that either R, => ¢ or R,, “diverges almost surely” 
from c; that is, P{x: lim, , . R,(«) = c} = 0. Thus, for example, it is impossible 
that P{R, > c} = 1/3. 
Let R,, Ro, . . . be independent random variables, with R, = 1 with probability 
Pn> Ryn = 0 with probability 1 — p,. 

P 
(a) What conditions on the p, are equivalent to the statement that R, —> 0? 


(b) What conditions on the p,, are equivalent to the statement that R,, "> 0? 


Let R,, Ry, ... be independent random variables, with E(R,) =0, Var R, = 
On” < M[n, where M is some fixed positive constant. Show that (R, +--- + 


R,,)/n 0. 


Give an example of a particular sequence of random variables R,, Re,... 
and a random variable R such that 0 < P{R, — R} < 1. 


Markov Chains 


7.1 INTRODUCTION | 


Suppose that a machine with two components is inspected every hour. A 
given component operating at t = n hours has probability p of failing before 
the next inspection at t= n-+ 1; a component that is not operating at 
t =n has probability r of being repaired at t =n + 1, regardless of how 
long it has been out of service. (Each repair crew works a 1l-hour day and 
refuses to inform the next crew of any insights it may have gained.) The 
components are assumed to fail and be repaired independently of each other. 
The situation may be summarized as follows. If R, is the number of 

components in operation at ¢t = n, then 

P{R a1 = 0| R, = 0} = (1 — 7)? 

P{Ryi = 1| Ry, = 0} = 2r(1 — r) 

P{R wy = 2|R, =O} =P? 


P{Rna = 0| R,, = 1} =p(l —r) 

PER usa = 1] Ry = 1 = pr +1 — pl) 

PIR = 2|R, = = — pir 

P{R aw = 0| R, = 2} = p? 

P{R iy = 1| Ry = 2} = 2p(1 — p) 

P{Rayx = 2|R, = 2} = 1 — p)? (7.1.1) 
211 
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2r(i-r) 


FIGuRE 7.1.1 A Markov Chain. 


For example, if component A is operating and component B is out of 
service at f =n, then in order to have one component in service at t = 
n-+ 1, either A fails and B is repaired between t = n and t =n-+ 1, or A 
does not fail and B is not repaired between ¢ = n and t = n + 1. In order to 
have two components in service at tf = n + 1, A must not fail and B must be 
repaired. The other entires of (7.1.1) are derived similarly. 

Thus, at time ¢ = n, there are three possible states, 0, 1, and 2; to be in 
State i means that 7 components are in operation; that is, R, = i. There are 
various transition probabilities p,; indicating the probability of moving to 
state j at t =n -+ 1, given that we are in state i at ¢ = n; thus 


Pix = P{Rags = j| R, = i} 
(see Figure 7.1.1). 
The p,; may be arranged in the form of a matrix: 


0 1 2 

O/; (1 —r) 2r(i — r) i? 
= [pyJ=1{pQ—-—r) pr+d-p)d-r) d-p) 
2, Pp 2p(1 — p) (1 — p)? 


Notice that II is stochastic; that is, the elements are nonnegative and the 
row sums >, p,; are 1 for all i. 

If R,, is the state of the process at time n, then, according to the way the 
problem is stated, if we know that Ry = iy, Ry = i,...,R, = i, = (Say) 2, 
we are in state 2 at t = n. Regardless of how we got there, once we know that 
R, = 2, the probability that R,., =j is @)d — p)’p?’, 7 = 0,1, 2. In 
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other words, 
P{R a1 = Tns1 | Ry = Ip, --- » Ry = in} = P{Ra41 = Tnps | R, = i,} (7.1.2) 
This is the Markov property. 

What we intend to show is that, given a description of a process in terms of 
states and transition probabilities (formally, given a stochastic matrix), 
we can construct in a natural way an infinite sequence of random variables 
satisfying (7.1.2). Assume that we are given a stochastic matrix II = [p,,], 
where i and 7 range over the state space S, which we take to be a finite or 
infinite subset of the integers. Let p;, ic S, be a set of nonnegative numbers 
adding to 1 a initial distribution). We specify the joint probability function 
of Ry), R,,..., R, as follows. 

P{Ry = ee oer 
= Pi PigiyPiyig °° Pin-vin? = 1,2,...,P{Ry =} = Pi, (7.1.3) 

If we sum the right side of (7.1.3) over i,41,...,i,, we obtain p; Di ° °° 
Pi,_,i,» since II is stochastic. But this coincides with the original specification 
of P{Ry = fh, Ry = iy,..., Ry = igh. 

Thus the joint probability functions (7.1.3) are consistent, and therefore, 
by the discussion in Section 6.1, we can construct a sequence of random 
variables Ry, R,,... such that for each n the joint probability function of 
(Ro, ..-», R,) is given by (7.1.3). 

Let us verify that the Markov property (7.1.2) is in fact satisfied. We 
have 


POR ae = as | Rp io Re — 1 
_ PUR y= test ata oy} 
PS Ry = lence Ra = th 


(assuming the denominator is not zero) 


= Pinin41 by (7.1.3) 
But 


PLR, = ins Raga = In 
P{Riat = ings | Ra = ig} = PLRn = bn Raga = tna 


P{R,, = i,} 
S PR Sie Ries = a RS hae 1a 


__ Zo peeves in—1 
OS) (PRS he =, 
zo acer, in—1 

> Pi, Pigiy i Pin _sinPininsr 

__% pees, in-1 
>. Pip Pini <i Pin vin 

20 5<s0sta—1 

= Pinintr 


establishing the Markov property. 
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The sequence {R,,} is called the Markov chain corresponding to the matrix 
II and the initial distribution {p,}. II is called the transition matrix, and the 
Piz the transition probabilities, of the chain. 


REMARK. The basic construction given here can be carried out if we have 
“nonstationary transition probabilities,” that is, if, instead of the 
“stationary transition probabilities” p,,, we have, for each n = 0, 
1,..., a stochastic matrix [,,p,,]. ,p,; is interpreted as the probability 
of moving to state j at time m + 1 when the state at time n is i. We 
define P{Ry = io, ..., Ry = In} = Pi, oPigty Pizig ” * n—1Pin_yig: This 
yields a Markov process, a sequence of random variables satisfying 
the Markov property, with P{R, =i, | Ro = %,..-, Rea =ina} = 


n—-1Pi,_yin* 


We begin the analysis of Markov chains by calculating the probability 
function of R,, in terms of the initial probabilities p; and the matrix II. Let 
pr = P{R, =j},n = 0, 1,...,j7¢S (thus p = p,). If we are in state 
i at time n — 1, we move to state j at time m with probability P{R, = 
j|R,+ = i} = pi; thus, by the theorem of total probability, 


Py) = S PURa1 = PAR, =J[Rea == Lv Ps (7-14) 


If v'™ = (p!”, ie S) is the “state distribution” or “state probability vector” 
at time n, then (7.1.4) may be written in matrix form as 


yp) = y("—-DT] 
Iterating this, we obtain 
vi) == plo] (7.1.5) 


But suppose that we specify Ry) =i; that is, pf =1, pO =0, i}. 
Then v® has a 1 in the ith coordinate and 0’s elsewhere, so that by 
(7.1.5) vo™ is simply row i of II”. Thus the element p{” in row i and column 
j of Il” is the probability that R, = 7 when the initial state is i. In other 
words, 

Piz) = P{R, = j | Ro = 3} (7.1.6) 


(A slight formal quibble lies behind the phrase “‘in other words” ; see Problem 
1.) Because of (7.1.6), Il” is called the n-step transition matrix: it follows 
immediately that II” is stochastic. 

We shall be interested in the behavior of II” for large n. As an example, 


suppose that 
' 
i 
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5 3 154 102 

8 8 256 256 
on : 

9 7 153 103 

16 16 256 256 


Thus p{? ~ p and p\4) ~ p, so that the probability of being in state j at 
time ¢ = 4 is almost independent of the initial state. It appears as if, for large 
n, a “steady state” condition will be approached ; the probability of being in a 
particular state j at t = n will be almost independent of the initial state at 
t = 0. Mathematically, we express this condition by saying that 


We compute 


lim p!” = v,,i,7€S (7.1.7) 
where v; does not depend on i. 

In Sections 7.4 and 7.5 we investigate the conditions under which (7.1.7) 
holds; it is not true for an arbitrary Markov chain. 

Note that (7.1.7) is equivalent to the statement that II” — a matrix with 
identical rows, the rows being (v;, j € S). 


> Example 1. Consider the simple random walk with no barriers. Then 
S' = the integers and p; 44 =P, Pic-1 =Q=1—p,i=0, +1, +2,.... 
If there is an absorbing varrier at 0 (gambler’s ruin problem when the 
adversary has infinite capital), S = the nonnegative integers and p; ,., = p, 
Pit = 9,1=1,2,..., Poo = 1 (hence po; = 0, 7 # 0). 
If there are absorbing barriers at 0 and 5b, then S= {0,1,..., 5}, 
Piint = Ps Pia =Gi=1,2,... »5 — 1, Poo =P» =l.< 


p> Example 2. Consider an infinite sequence of Bernoulli trials. Let state 1 
(at ¢ = n) correspond to successes (S) at t= n— 1 and att =n; state 2 
to success at t = n — 1 and failure (F) at ¢t = n; state 3 to failure att = n — 1 
and success at t = n; state 4 10 failures at t= n— 1 and at t =n (see 
Figure 7.1.2). We observe that Il? has identical rows, the rows being 


q 


oN 
o 8 
S 
o 


oN 
mQ 
a) 
S 


P 


FIGURE 7.1.2 
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(p*, pg, gp, 9°). Hence Il” has identical rows for n > 2 (see Problem 2), so 
that Il” approaches a limit. < 


> Example 3. (A queueing example) Assume that customers are to be 
served at discrete times ¢ = 0,1,..., and at most one customer can be 
served at a given time. Say there are R, customers before the completion of 
service at time n, and, in the interval [n, nm + 1), N, new customers arrive, 
where P{N, = k} = p,, k =0,1,.... The number of customers before 
completion of service at time n + 1 is 


Raw = (R, — Ese 5 
That is, 
Rui=R,-1+N, ifR, >1 
=N, ifR,=0 


If the number of customers at time n+ 1 is >M, a new serving counter 
automatically opens and immediately serves all customers who are waiting 
and also those who arrive in the interval [n + 1, + 2); thus R,.. = 0. 

The queueing process may be represented as a Markov chain with S = the 
nonnegative integers and transition matrix 


0 1 2 3 
O| Po Pr Pe 
1] Po Pi Pe 
2| 0 po Pi Pe 
3{|0 O po pr Pe 
Il = 
M — 1 0 0 0 0 0 Po Pi P2 
M;1 0 
M +1 1 O 
< 
PROBLEMS 


1. Consider a Markov chain {R,} with transition matrix II = [p;,] and initial 
distribution {p,}; assume p, > 0. Let {T,,} be a Markov chain with the same 
transition matrix, and initial distribution {q;}, where g, =1, 9; =0, j ¥r. 
Show that P{R, =j| Ry =r} = P{T, =j} = p{™; this justifies (7.1.6). 
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2. If II is a transition matrix of a Markov chain, and II* has identical rows, show 
that II” has identical rows for all n > k. Similarly, if 1* has a column all of 
whose elements are >6 > 0, show that II” has this property for all n > k. 

3. Let {R,} be a Markov chain with state space S, and let g be a function from S 
to S. If g is one-to-one, {g(R,)} is also a Markov chain (this simply amounts to 
relabeling the states). Give an example to show that if g is not one-to-one, 
{g(R,)} need not have the Markov property. 


4. If {R,,} is a Markov chain, show that 
PiRns1 =1,--.-,Rnapzr = iy | R, = i} =P Pies Pet 


7.2 STOPPING TIMES AND THE STRONG 
MARKOV PROPERTY 


Let {R,,} be a Markov chain, and let T be the first time at which a particular 
state i is reached; set T = 00 if 7 is never reached. For example, if i = 3 and 
R,(@) oad 4, R,(@) 2, R,(@) aoe 2, R;(a) ren a, R,(@) a > R;(@) oad I, 
R,(w) = 3,..., then JT(w)=4. For our present purposes, the key 
feature of J is that if we examine Ry, R,,...,R,, we can come to a 
definite decision as to whether or not T =k. Formally, for each k = 0, 
1,2,..., I;p_,, can be expressed as g,(Ro, Ri, ..-, R,), where g;, is a func- 
tion from S**! to {0, 1}. A random variable T, whose possible values are the 
nonnegative integers together with oo, that satisfies this condition for each 
k =0,1,...is said to be a stopping time for the chain {R,}. 

Now let T be the first time at which the state i 1s reached, as above. If we 
look at the sequence {R,,} after we arrive at 7, in other words, the sequence 
Ry, Rri,---, it is reasonable to expect that we have a Markov chain with 
the same transition probabilities as the original chain. After all, if T= k, 
we are looking at the sequence R,, Ry41, .... However, since 7 is a random 
variable rather than a constant, there is something to be proved. We first 
introduce a new concept. 

If T is a stopping time, an event A is said to be prior to T iff, whenever 
T =k, we can tell by examination of Ry,...,.R, whether or not A has 
occurred. Formally, for each k = 0,1,..., Lqgp_,) can be expressed as 
h,( Ro, Ri, ..., R,), where h, is a function from S** to {0, 1} 


> Example 1. If 7 is a stopping time for the Markov chain {R,}, define 
the random variable R, as follows. 

If T(w) = k, take Rp(wm) = R,(@), k = 0,1,.... 

If T(w) = 00, take Rp(w) = c, where c is an arbitrary element not be- 
longing to the state space S. If we like we can replace S by S U {c} and define 
Pee = 1, Po = 9,7 ES. 
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IfA = {Rp_; = i}, ie S,j7 =0,1,..., then A is prior to T. Set Rp_; = 


R, if 7 > T.) 
To see this, note that {Rp_,; = BA{T=k} = {R,, =} A{T=k}; 
examination of Ry,...., R, determines the value of R,_;, and also deter- 


mines whether or not T= k. < 
> Example 2. If 7 is a stopping time for the Markov chain {R, b then 
{T = r}\ is prior to T for allr = 0,1,.... For 
{TH=riA{T=k= fg ifrAk 
= {T=k} ifr =k 

In either case Ipp_,jnpp—z} iS a function of Ry,...,.R,, since T is a stop- 
ping time. < 

Theorem 1. Let T be a stopping time for the Markov chain {R,}. If A is 
prior to T, then 
P(A O {Rp = i, Roy = i,» Rove = ip}) 

— P(A CY {Rr P| 1}) Pri, Pixty ae "Pi, ri, (s Li, ernie ae 1, E S) 
Proor. The probability of the set on the left is 


> P(A O{T =n, R, = i, Rau = i, ~.-, Rare = igh) 


n=0 


=S P(A O{T =n, R,, = i}) 


X P(Raw = iy,..-, Rare =i, | A A(T =n, Ry = i}) 


(Actually we sum only over those n for which P(A 1 {T = n, R,, = i}) > 0.) 
Now 


PCR = Lig ha > Rag = i, | A O{T =n, R, = i}) 
= P{R, 4 = Iyy sues Rik = i, | Ra = i} 
since [4 ,;p_,,)is a function of Ry, Ri,..., R, (see Problem 1) 
= Pit, Piyig? °° Pry_yiz 
(Problem 4, Section 7.1). Thus the summation becomes 


2P (AN {T =n,R, = i}) Psi, Pisis "°° Din viz 


= P(A {Rr = i}) DiePiyia °° * Di sin 
and the result follows. 
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Theorem 2 (Strong Markov Property). Let T be a stopping time for the 
Markov chain {R,,}. Then 


(a) PiRpy = h,..., Roig = ly | Rp = i} = PiiPit, °° Piz_yte if 
P{Ryp =i} >0 Gh,...,i,€S8) 
(b) If A is prior to T, then 


P(Rpw = 4... Roi, = i, | A O {Rp = i}) 
P{Rpw = th, -., Roy, = ig | Ro = 1} 
if P(A 0 {Rp = i!) > 0 (i, i;,...,4,€8) 


Proor. (a) follows from Theorem 1 by taking A = Q. (b) follows upon 
dividing the equality of Theorem 1 by P(A AM {Rp = i}) and using (a). 

Thus the sequence Rr, Rp,1,... has essentially the same properties as the 
original sequence Ry, Ri,.... 


REMARK. The strong Markov property reduces to the ordinary Markov 
property (7.1.2) if we set k=1, T=n, and A= {Ryo =/fy,..., 
R,-1 = Jn}. For T is a stopping time since 
Tipeny = 2y(Ro»-- +» R,) = 0 ifk An 
=1l ifk=n 
and A is prior to T since 
Lancp—p} = Ay (Ro, --- » Ry) 
== | ifk =nand Ry = jfy,..., Raa = Jn—-1 


= 0 otherwise 


PROBLEMS 


1. Let {R,,} be a Markov chain. If D is an event whose occurrence or nonoccurrence 
is determined by examination of Ry, ..., Rp, that is, I isa function of Ro,..., 
R,, Or, equivalently, D is of the form {(Ro,..., Ry) € B} for some Bc S"*", 
show that 

P(Rnyy = ty,-++ > Rae = iy | DO{Rn =i}) 
= P{Rniy =ty,.--5 Ray = ip | Rn = ih if PD A {R, =i}) > 0 

2. If {R,} is a Markov chain, show that the “reversed sequence” --- R,, Ray, 

Ry_2, ... also has the Markov property. 
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7.3 CLASSIFICATION OF STATES 


In this section we examine various modes of behavior of Markov chains. 
A key to the analysis is the following result. We consider a fixed Markov 
chain {R,,} throughout. 


Theorem 1 (First Entrance Theorem). Let f‘” be the probability that the 
first return to i will occur at time n, when the initial state is i, that is, 


yw = P{R, =i,R, #iforl!<k<n—1|/R,=i}, n=1,2,... 
If i A j, let f{® be the probability that the first visit to state j will occur at time 
n, when the initial state is i; that is, 


{) = P{R, = j,R, ~j for 1<k<n—1|R,= jf, (es ee ee 


Then 


n 

(n) __ (k) ,.(n—k) a 

Dis = > Ses Pai : n=1,2,... 
k= 


Proor. Intuitively, if we are to be in state j after n steps, we must reach j 
for the first time at step k, 1 < k <n. After this happens, we are in state 7 
and must be in state j again after the n — k remaining steps. For a formal 
proof, we use the strong Markov property. Assume that the initial state is i, 
and let T be the time of the first visit to 7 (T = min {k > 1:R, =} if R, =] 
forsomek = 1,2,...;T= oif R, ¥jforallk =1,2,...). Then 


PIR, = J} =2P(R =), =H} 


= INE =k, Roun se = 5} 
But 


PIT =k, Rona = J} = PLT = k} P{Ry, 4 = J|T =} 
and since {7 = ki} = {T=k, Rp = j}, 
P{Roin te =i | T= k} aca P{Roin te =Ji|Re =j,T= k} 


= PLR oink = J | Ry = j} 
by Theorem 2b of Section 7.2 


s= pin-®) by Theorem 2a of Section 7.2 
Since P{T = k} = f\’, the result follows. 


Now let 
~ p(n) 
n=1 


Fix is the probability of eventual return to state i when the initial state is i. 
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Theorem 2. [If the initial state is i, the probability of returning to i at least 
r times is (f;,;)". 


Proor. The result is immediate if r = 1. If it holds when r = m — 1, let 
T be the time of first return to i. Then, starting from 7, P{return to 7 at least 
m times} = >, P{T = k, at least m — 1 returns after T}. But 
P{T = k, at least m — 1 returns after T} 

= P{T = k} P{Rp,1, Ross, ... returns to i at least m — 1 times | T = k} 

= P{T = k} 

X P{Rpi1, Roy»... returns to i at least m — 1 times | Rp = i, T= k} 


By the strong Markov property this may be written as 


: P{T = k}P{Rpi1, Rois, ... returns to i at least m — 1 times | Ry = i} 


= P{T = k}P{R,, Ro, ... returns to i at least m — 1 times | Ry = i} 
=f (f,)"1 by the induction hypothesis 


Thus the probability of returning to i at least m times is 
2 fie fay” = Sid fu)” 
which is the desired result. 


COROLLARY. Let the initial state be i. If f,;; = 1, the probability of 
returning to 7 infinitely often is 1. If f,;; < 1, the probability of returning to i 
infinitely often is 0. 


Proor. The events {return to i at least r times}, r= 1, 2, ... forma 
contracting sequence whose intersection is {return to i infinitely often}. Thus 
the probability of returning to 7 infinitely often is lim,,,. (/;,)", and the 
result follows. 


DEFINITION. If f,;; = 1, we say that the state i is recurrent or persistent; if 
Siu = 1, we say that 7 is transient. 


It is useful to have a criterion for recurrence in terms of the probabilities 
pm, since these numbers are often easier to handle than the f{”. 


r) 


Theorem 3. The state i is recurrent iff >°_, p\” = ©. 


22 


PRooF. By the first entrance theorem, 


n 


(n) __ (k) .(n—k) 
Pi: => Ji Pi 


k=1 
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so that 
~ (n) (x) .(n—k) (Xe) (n—k) 
2 Pi; => > Sie Dit => fi Da 
a Fei > pi” 
Thus 
LP = at + ” 
Hence, if 


= pS 


then f,;; < 1, so that iis transient. Now 


Sa = : > Poe — AP Eph n—k) < SsP> ps” 


n=1 K=1 
Thus 
N 
N > py 00 
=> si h) > 2d Sa bi > = 7 | as N-—>oo if Pa ss 
Pi 
r=0 


Therefore >°_, p{” = oo implies that f,;, = 1, so that i is recurrent. 


We denote by f,; the probability of ever visiting 7 at some future time, 
starting from 7; that is, 


=> fe 


Theorem 4. If j is a transient state and i an arbitrary state, then 


'y pr< 


hence 


BwPr>0 as n> 


Proor. By the first entrance theorem, 


Do? = > y {Pp = LAPS Py 


n=1 k=1 


= fas 3 PSY 
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But f,;, being a probability, is <1, and >°_, p{” < oo by Theorem 3; the 
result follows. 


REMARK. If 7 is a transient state and the initial state of the chain is i, then, 
by Theorem 4 above and the Borel-Cantelli lemma (Theorem 4, 
Section 6.6), with probability 1 the state 7 will be visited only finitely 
many times. Alternatively, we may use the argument of Theorem 2 
(with initial state iand 7’ = the time of first visit to 7) to show that the 
probability that 7 will be visited at least m times is f,,(f;;)"1. 
Now fj; < 1 since 7 is transient, and thus, if we let m — oo, we find 
(as in the corollary to Theorem 2) that the probability that j will be 
visited infinitely often is 0. . 

In fact this result holds for an arbitrary initial distribution. For 


P{R,, = j for infinitely many n} 
= >, P{Ry = i}P{R, = j for infinitely many n| Ry, = i} = 0 


It follows that if B is a finite set of transient states, then the prob- 
ability of remaining in B forever is 0. For if R, € B for all n, then, 
since B is finite, we have, for some j € B, R,, = j for infinitely many n; 
thus 


P{R,, € B for all n} < > <p P{R, = j for infinitely many n} = 0 


One of the our main problems will be to classify the states of a given chain 
as to recurrence or nonrecurrence. The first step is to introduce an equivalence 
relation on the state space and show that within each equivalence class all 
states are of the same type. 


DEFINITION. If i and j are distinct states, we say that i leads to j iff f,; > 0; 
that is, it is possible to reach /, starting from i. Equivalently, i leads to 
j iff p{ > 0 for some n > 1. By convention, i leads to itself. We say 
that i and j communicate iff i leads to j and j leads to 7. 


We define an equivalence relation on the state space S by taking 7 equivalent 
to j iff i and 7 communicate. (It is not difficult to verify that we have a 
legitimate equivalence relation.) The next theorem shows that recurrence or 
nonrecurrence is a class property: that is, if one state in a given equivalence 
class is recurrent, all states are recurrent. 


Theorem 5. If i is recurrent and i leads to j, then j is recurrent. Further- 
more, fz; = fi: = 1. In fact, if f,, is the probability that j will be visited in- 
finitely often when the initial state is i, then f,, = f;, = 1. 
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PRooF. Start in state 7, and let T be the time of the first visit to 7. Then 
1=)>P{T =k}+ P{T = o} 
k=1 


= ¥ PIT = k, infinitely many visits to i after T} + P{T = oo} 
— since i is recurrent 
= SPT = K}P{Rp41, Rrys,- .. visits i infinitely often | T = k, Rp = 3} 
k=1 eis 
= > fiPP{R,, Ro, . . . visits i infinitely often | Ry = j} + 1—f,; 


Thus 
1 = fiifj, +1 —Siz or Tis = Sesh ji 
Since f,; > 0 by hypothesis, f;, = 1. 
Now if p{ > 0, p's’ > 0, then 


(n+r+s) (s)_.(n)_.(r) 
D33 = Dj Pii Dis 


since one way of going from j to 7 inn+r-+s steps is to go from / to i 
in s steps, from ito jinn steps, and finally from to/ inr steps. It follows from 
Theorem 3 that }°_, p$%) = 00; hence j is recurrent. 

Finally, we have 7 recurrent and fj, > 0..By the above argument, with i 
and j interchanged, f;, = 1. Since f;, >f;;, fj, 2J;,. it follows that f,; = 
Ji; = 1 and the theorem is proved. 


Theorem 6. If a finite chain (i.e., S a finite set), it is not possible for all 
states to be transient. 

In particular, if every state in a finite chain can be reached from every other 
state, so that there is only one equivalence class (namely S), then all states are 
recurrent. 


Proor. If S = {1,2,...,r}, then >”, p{” = 1 for all n. Let n > oo. By 
Theorem 4 and the fact that the limit of a finite sum is the sum of the limits, 


we have 0 = >7_, lim, p§? = lim,.. >7_, p\™ = 1, a contradiction. 


In the case of a finite chain, it is easy to decide whether or not a given class 
is recurrent; we shall see how to do this in a moment. 


DEFINITION. A nonempty subset C of the state space S is said to be closed 
iff it is not possible to leave C; that is, > j<¢ pi = 1 for all iE C. 
Notice that if C is closed, then the submatrix [p,,], i, 7, € C, is sto- 
chastic; hence so is [p{”], i, 7 EC. 
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Theorem 7. C is closed iff for all i € C (i leads to j implies j € C). 


Proor. Let C be closed. If i € C and i leads to j, then p‘” > 0 for some n. 
If 7 € C, then dec p> <1, a contradiction. Conversely, if the condition is 
satisfied and C is not closed, then a Py <1 for some ie C; hence 
Pi > 0 for some i € C, j € C, a contradiction. 

Theorem 8. 

(a) Let C be a recurrent class. Then C is closed. 

(b) If C is any equivalence class, no proper subset of C is closed. 

(c) Ina finite chain, every closed equivalence class C is recurrent. 

Thus, in a finite chain, the recurrent classes are simply those classes that 
are closed. 


PROOF. 

(a) Let C be a recurrent class. If C is not closed, then by Theorem 7 we 
have some i € C leading to aj ¢ C. But by Theorem 5, i and j/ communicate, 
and so i is equivalent to j. This contradicts ie C,j € C. 

(b) Let D be a (nonempty) proper subset of the arbitrary equivalence 
class C. Pick ie D andj € C, 7 ¢ D. Then i leads to /, since both states belong 
to the same equivalence class. Thus D cannot be closed. 

(c) Consider C itself as a chain; this is possible since Deg pay = 1, 1€ C. 
(We are simply restricting the original transition matrix to C.) By Theorem 6 
and the fact that recurrence is a class property, C is recurrent. 


> Example 1. Consider the chain of Figure 7.3.1. (An arrow from i to j 
indicates that p,; > 0.) There are three equivalence classes, C, = {1, 2}, 
C, = {3,4, 5}, and C; = {6}. By Theorem 8, C, is recurrent and C, and C3 
are transient. 

There is no foolproof method for classifying the states of an infinite chain, 
but in some cases an analysis can be done quickly. Consider the chain of 
Example 3, Section 7.1, and assume that all p, > 0. Then every state is 
reachable from every other state, so that the entire state space forms a 


ol 


FiGure 7.3.1 A Finite Markov Chain. 
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single equivalence class. We claim that the class is recurrent. For assume the 
contrary ; then all states are transient. Let 0 be the initial state; by the remark 
after Theorem 4, the set B = {0,1,..., M — 1} will be visited only finitely 
many times; that is, eventually R, > M (with probability 1). But by definition 
of the transition matrix, if R, > M then R,,,, = 0, a contradiction. < 


We now describe another basic class property, that of periodicity. 

If p\ > 0 for some n > 1, that is, if starting from i it is possible to return 
to i, we define the period of i (notation: d,) as the greatest common divisor 
of the set of positive integers n such that p‘” > 0. Equivalently, the period 
of i is the greatest common divisor. of the set of positive integers n such that 
£5” > 0 (see Problem Ic). 


Theorem 9. If the distinct statés i and j are in the same equivalence class, 
they have the same period. 


Proor. Since i and j communicate, each has a period. If p™ > 0, 
pi) > 0, then p§et7ts) > p's) p(™ p™ (see the argument of Theorem 5). 
Set n = 0 to obtain p%+*) > 0, so that r+ s is a multiple of d,. Thus if n 
is not a multiple of d,; (so neither is m + r +s) we have p{"+"+s) = 0; hence 
p\® = 0. But this says that if p‘” > 0 then nis a multiple of d,; hence d; < d,. 
By a symmetrical argument, d; < d;. 


The transitions from state to state within a closed equivalence class C of 
period d > 1, although random, have a certain cyclic pattern, which we now 
describe. 

Let i, jE C; if p > 0 and p‘*) > 0, let ¢ be such that p® > 0. Then 
pit? > p® p® > 0; hence d divides r + ¢. Similarly, d divides s + t, and 
so d divides s — r. 

Thus, if r= ad + b, a and 5 integers, O< b< d—1, thens=cd+b 
for some integer c. Consequently, if i leads to 7 in n steps, then n is of the form 
ed + b, that is, 

n= bmodd 


where the integer b depends on the states 7 and j but is independent of n. 


Now fix 7 € C and define 
Cy = {j EC: pS” > 0 implies n = 0 mod a} 


ce) 


ce) 


C, = {j eC: p\” > 0 implies n = 1 mod d} 


Cri = {j eC: p\” > 0 implies n = d — 1 mod d} 
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5 nc nape 


Ficure 7.3.2. A Periodic Chain. 


Then 
d—1 
C=UC, 


j=0 

Theorem 10. If k eC, and p,; > 0, then j € C,,, (with indices reduced 
mod d; i.e., Cg = Cy, Cary = Cy, etc.). Thus, starting from i, the chain 
moves from C, to C, to... to Cy_4 back to Cy, and so on. The C; are called the 
cyclically moving subclasses of C. 


Proor. Choose an 7 such that p\”) > 0. Then xn is of the form ad + t. 
Now pi) > p™ p,, > 0; hence i leads to j, and therefore j € C, since C 
is closed. But n = t mod d; hence n + 1 = ¢+= 1 mod d, so that jE C,,,. 


p> Example 2. Consider the chain of Figure 7.3.2. Since every state leads 
to every other state, the entire state space forms a closed equivalence class C 
(necessarily recurrent by Theorem 6). We now describe an effective procedure 
that can be used to find the period of any finite closed equivalence class. 
Start with any state, say 1, and let C, be the subclass containing 1. Then all 
states reachable in one step from | belong to C,; in this case 3 € C,. All 
states reachable in one step from 3 belong to Cy; in this case 5, 6 € Cy. 
Continue in this fashion to obtain the following table, constructed according 
to the rule that all states reachable in one step from at least one state in 
C;,, belong to C,,4. 


Co Cy Ce C3 C4 Cs Ce 
1 3 5, 6 Z 4 7 Ls 


Stop the construction when a class C;, is reached that contains a state belong- 
ing to some C;, j < k. Here we have 2€C; M C,; hence C3; = C,. Also, 
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leEC, 1 C,; hence Cy = Cz = Cy. Repeat the process with Cy = the class 
containing | and 2; all states reachable in one step from either 1 or 2 belong 
to C,. We obtain 

Co Cy C, Cs 


1,2 3,4 5, 6,7 1,2 


We find that Cy = C; (which we already knew). Since C) U C, U C, is the 
entire equivalence class C, we are finished. We conclude that the period is 3, 
and that Cy = {1, 2}, C, = {3,4}, C, = {5, 6, 7}. 

If C has only a finite number of states, the above process must terminate 
in a finite number of steps. 

We have the following schematic representation of the powers of the 
transition matrix. 


GO GA CG Cy Cy Cy 
C[O0 2x 0 C,[ 0 0 
T=C,)}0 O «@ I= Cj « 0 
C,L2 0 0 C,| O «x 
Co Cr Ce 
Cl « OO 0 
?=C,}0 «z O (x stands for positive element) 
C,|0 OO « 


Notice that Il* has the same form as II but is not the same numerically; 
similarly, [1° has the same form as II*, I1®has the same form as II®, and 
soon. < 


> Example 3. Consider the simple random walk. 

(a) If there are no barriers, the entire state space forms a closed equivalence 
class with period 2. We have seen that fp) = 1 — |p — q| [see (6.2.7)]; 
by symmetry, f;, = foo for all 7. Thus if p = q the class is recurrent, and 
if p ~ q the class is transient. 

(b) If there is an absorbing barrier at 0, then there are two classes, C = {0} 
and D = {1,2,...}. C is clearly recurrent, and since D is not closed, 
it is transient by Theorem 8. C has period 1, and D has period 2. 

(c) If there are absorbing barriers at 0 and b, then there are three equiva- 
lence classes, C = {0}, D= {1,2,...,b5—1}, E = {b}. C and E 
have period 1 and are recurrent; D has period 2 and is transient. < 


REMARK. We have seen that if B is a finite set of transient states, the 
probability of remaining forever in B is 0. This is not true for an 
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infinite set of transient states. For example, in the simple random walk 
with an absorbing barrier at 0, if the initial state is x > 1 and p > gq, 
there is a probability 1 — (q/p)* >0 of remaining forever in the 
transient class D = {1,2,...} [see (6.2.6)]. 


TERMINOLOGY. A state (or class) is said to be aperiodic iff its period d is 1, 


periodic iff d > 1. 


PROBLEMS 


1. 


(a) Let A be a (possibly infinite) set of positive integers with greatest common 
divisor d. Show that there is a finite subset of A with greatest common 
divisor d. 

(b) If A is a nonempty set of positive integers with greatest common divisor d, 
and A is closed under addition, show that all sufficiently large multiples 
of d belong to A. 

(c) If d; is the period of the state i, show that d; is the greatest common divisor 
of {nm > 1: fir) > O}. 


. A state i is said to be essential iff its equivalence class is closed. Show that 7 is 


essential iff, whenever 7 leads to , it follows that 7 leads to i. 


. Prove directly (without using Theorem 5) that an equivalence class that is not 


closed must be transient. 


. Classify the states of the following Markov chains. [In (a) and (b) assume 0 < 


p<tid 

(a) Simple random walk with reflecting barrier at 0 (S = {1,2,...}, py =4@, 
Pijiga = p for alli, Pi =GEl= 235439) 

(b) Simple random walk with reflecting barriers at 0 and/ + 1 (S = {1,2,..., 

I}, Pu = % Pu =P> Piizi = P?P> L=1,2,.30gf = 1, Pii-1 = P=) .3. 


Zid 


(c) 
2 8 0 0 
T1l=|0 04.41 .9 
002 8 
7 3° O°. 0 
(d) 
zs 3 O 
1l=|]0 3 3 
05 3 
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5. Let Ry, R,, Re,... be independent random variables, all having the same 
distribution function, with values in the countable set S; assume P{R; = j} > 0 
for all JES. 


(a) Show that {R,} may be regarded as a Markov chain; what is II? 
(b) Classify the states of the chain. 


6. Let i be a state of a Markov chain, and let 
[e6) fo. @) 
H® => fi, U@™= Spx, al <1 
n=0 n=0 


[take f{9) = 0]. Use the first entrance theorem to show that U(z) —1= 


7.4 LIMITING PROBABILITIES 


In this section we investigate the limiting behavior of the n-step transition 
probability p$”. The basic result is the following. 


Theorem 1. Let fi, fo,... be a sequence of nonnegative numbers with 
adn = 1, such that the greatest common divisor of {j: f; > O} is I. Set 
U=1, u, = >, flo» n= 1,2,.... Define w= >%_ fy. Then 


u,—>l/uasn— o. 


We shall apply the theorem to a Markov chain with f, = f,”, ia given 
recurrent state with period 1; then u,, = p,” by the first entrance theorem. 
Also, uw = uw, = >”, nf”, so that if T is the time required to return to i 
when the initial state is i, then uw; = E(T). If iis an arbitrary recurrent state 
of a Markov chain, pw, is called the mean recurrence time of i. 

Theorem 1 states that p!”’ — 1/u,;thus, starting in i, thereis a limiting prob- 
ability for state i, namely, the reciprocal of the mean recurrence time. 
Intuitively, if wu; = (say) 4, then for large n we should be in state i roughly 


one quarter of the time, and it is reasonable to expect that p)” — 1/4. 


Proor. We first list three results from real analysis that will be needed. 
All numbers a,,;, c;, a;, b; are assumed real. 
1. Fatow’s Lemma: If |ay| <¢;, k,j =1,2,...,and >; c¢; < oo, then 


lim sup p3 Anj < > lim sup a,; 


ke 00 k~ oo 


and 
lim inf >, a,; > > lim inf a,; 


k- a ji k>@ 
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The “‘lim inf” statement holds without the hypothesis that 
layj|] << cy dD c;< 0, ifalla,;>0 
j 


2. Dominated Convergence Theorem: If |a,;|<c;, k, j=1,2,..., 
>; C; < 00, and limy_, « Ay; = a;,j= 1s ens , then 
lim > a,; = > lim a( = > a, 
ko j j ko j 
(The dominated convergence theorem follows from Fatou’s lemma. Alter- 
natively, a fairly short direct proof may be given.) 
3. lim inf,_,.. (@ + 5,) < lim inf, ,.. a+ lim sup,_,. 5,. 
(This follows quickly from the definitions of lim inf and lim sup.) 


We now prove Theorem 1. First notice that 0 <u, <1 for all n (by 
induction). Define 


eee oe n=0,1,... 


j=nt+1 


Then 
Uy = fiUn—1 + res + fro = (To — ry )Un—1 =F : zi (Tri = rn )Ug; n > 1 
(7.4.1) 


Since ry = >°., f; = 1, we have u,, = rou,,, and thus we may rearrange terms 
in (7.4.1) to obtain 


Poly HU to EU = Una Ht no, n> 1 


This indicates that >,” _, r,l,—, is independent of n; hence 
> ne = Noto = 1, n=0,1,... (7.4.2) 
k=0 


[An alternative proof that >”, r,U,_, = 1: construct a Markov chain with 


= fas Pi? =u, (see Problem 1). Then 


DM illn—e = 2 Val nt 
k=0 k=0 
=> pPP{T >n—k| Ry = i} 
k=0 
(where T is the time required to return to i when the initial state is 7) 
=X PIR = 1 Regge Se teeny Rae | | Rot, 


= P{R, = i for some k= 0,1,...,0| Ry = i} 
21 
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Now let b = lim sup, u,. Pick a subsequence {u,,,} converging to b. Then 


b=limu,, = lim inf u,, 
ke 
Nk 
= lim inf it =i ¥ tae 
k j=! 


j#t 


< lim inf (fitta,:) + (3 fi) 
k j=1 


j#t 


[We use here the fact that lim -inf, (a, + b,) < lim inf, a, + lim sup, 5. 
Furthermore, if we take u, = 0 for n < 0, then 


lim sup Pz 1 <3 Vi lim uP (Uni) S by iy 


jFi are a 


Notice that since | f;u, _;| <f; and +315; < ©, Fatou’s lemma applies.] 
Therefore 


b<f,lim infu, ,+ (1 —f,)b 
k 


or 
f, lim infu, , > f,b 
k 


Thus f; > 0 implies u,,;— b as k > oo. 

It follows that u,,_; > 6 for sufficiently large i. For if f, > 0, we apply the 
above argument to the sequence u,,_;, k = 1,2,..., to show that f; > 0 
implies u,, _,_; > 6. Thus if t = >. aa where the da, are positive integers 
and f; > 0, then Un, ,—> 0. The set S of all such ¢’s is closed under addition 
and has greatest common divisor 1, since S is generated by the positive integers 
i for which f, > 0. Thus (Problem 1b, Section 7.3) S$ contains all sufficiently 


large positive integers. Say u,,_;— b fori > I. By (7.4.2), 


Ny—t 
Dr iMa-t-j = 1 (with wu, = 0 forn < 0) (7.4.3) 


If yr; < 0, the dominated convergence theorem shows that we may let 
k — co and take limits term by term in (7.4.3) to obtain b > 2,7; = 1. If 
of; = 00, Fatou’s lemma gives 1 > b >” 7r;; hence b = 0. In either 


case, then, 
fore) —1 
b= [Bn 
j=0 


But 
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mr=ftht+tft- °° 
fetfs+-°° 
La 


n= 


rg = 


Hence 
= se > i 
n 1 
Consequently b = lim sup, u, = 1/u. By an entirely symmetric argument, 


lim inf, vu, = 1/u, and the result follows. 
We now apply Theorem 1 to gain complete information about the limiting 
behavior of the n-step transition probability p{”. A recurrent state 7 is said to 


be positive iff its mean recurrence time yu, is < 00, null iff wu; = 00. 
< 00 for all i, hence 


e-6) (n) 
n=1 Pi; 


Theorem 2. 
(a) If the state j is transient then > 
p?>0 asn—-oo 


Proor. This is Theorem 4 of Section 7.3. 
(b) If j is recurrent and aperiodic, and i belongs to the same equivalence 
class as j, then p\? — I|;. Furthermore, |; is finite iff 4, is finite. If i 
(n) —> “| [hj 
aj ¥) 3° 


belongs to a different class, then p 


PROOF. 
Bis => Saris 
k=1 
) = 0, r <0]. By the dominated con- 


by the first entrance theorem [take p\”, 
vergence theorem, we may take limits term by term as n — 00; since p{"-®) > 


1/u; by Theorem 1, we have 
(n) _. (24°) 1 _ Sis 

a) 

k=1 Ms fj 

(s). 


Pi; 
(r) »(n) 
ij Piz Piz > 


If i and j belong to the same recurrent class, f,; = 1. 
Now assume that yu, is finite. If p'7’, p's) > 0, then pirt™*? > p 


this is bounded away from 0 for large n, since p{” > 1/u,; > 0. But if u; = 0, 
then p\”) + 0 as n—> o, a contradiction. This proves (b). 
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(c) Let j be recurrent with period d > 1. Let i be in the same class as j, 
with i € the cyclically moving subclass C,, j € Cpiq Then pir*t? — dlus;. 
Also, pw; is finite iff u,; is finite, so that the property of being recurrent 
positive (or recurrent null) is a class property. 


Proor. First assume a = 0. Then j is recurrent and aperiodic relative to 
the chain with transition matrix II*. (If A has greatest common divisor d, then 
the greatest common divisor of {x/d: x € A} is 1.) By (b), 


: 1 d d 
ps > SS 


(kd) (ed) 4; 
SHY = kdf 
k=1 


Now, having established the result for a=r, assume a=r-+1 and 
write 


d ad 
pr 3 PixP pe ee >) Pn—- = — 
k i BG 
as asserted. 
The argument that y, is finite iff yu, is finite is the same as in (b), with nd 


replacing n. 
(d) If j is recurrent with period d > 1, and i is an arbitrary state, then 


py > sd Pe aiiad x a=0,1,...,d—1 
[hs 


The expression in brackets is the probability of reaching j from i in 
a number of steps that is = a mod d. Thus, if j is recurrent null, p\” — 0 
as n—>» oo for all i. 
PROOF. 
pier) — = *f@ piner®) a= 0, 1, ee d—1 


Since j has period d, p‘r4+¢-®) = 0 unless k — a is of the form rd (necessarily 
r <n); hence 


(nd+a) __ (rd+a) ..((n—r)d) 
Di; => fs D5i 


Let nm — oo and use (c) to finish the proof. 


(e) A finite chain has no recurrent null states. 
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Proor. Let Cbe a finite recurrent null class, say, C = {1, 2,...,r}. Then 


+6, Si haa @ 
Fai} 


Let n — 00; by (d) we obtain 0 = 1, a contradiction. 


PROBLEMS 


1. 


With f, and u,, as in Theorem 1, show how to construct a Markov chain with 
a state i such that f/”) = f,, and p‘”) = u, for all x. 


. (The renewal theorem) Let 7,, T,,... be independent random variables, all 


with the same distribution function, taking values on the positive integers. 
(Think of the 7; as waiting times for customers to arrive, or as lifetimes of a 
succession of products such as light bulbs. If T, +--- +7, =, bulb n has 
burned out at time x, and the light must be renewed by placing bulb m + 1 in 
position.) Assume that the greatest common divisor of {x: P{T;, = x} > O} is d 
and let Gin) = )°., P{T, +-:- + T, =nh,n =1,2,... .Ifu = E(T;), show 
that lim, ,. G(md) = d/u; interpret the result intuitively. 


. Show that in any Markov chain, (1/n) >7_, p{* approaches a limit asm > ©, 


namely, f;;/u,. (Define u, = oo if j is transient.) HINT: 


1 2 dj 
~>pp=>S- SF pw, d= period of 


. Let V;; be the number of visits to the state j, starting at 7. (If i = j, t = 0 counts 


as a Visit.) 

(a) Show that E(V;;) = >” _, ps). Thus i is recurrent iff E(V,,) = 0, and if is 
transient, E(V;;) < 0 for all i. 

(b) Let C be a transient class, N;; = E(V;;), i, 7 € C. Show that 


Ni; = 943 + > Pins (6, =1i=j 
keC 
= 0,i ¥)) 


In matrix form, N = J + QN, Q = II restricted to C. 
(c) Show that J — Q)N = NU — Q) =I so that N = UI — Q)" in the case 
of a finite chain (the inverse of an infinite matrix need not de unique). 


REMARK. (a) implies that in the gambler’s ruin problem with finite capital, the 


average duration of the game is finite. For if the initial capital is 7 and D is 
the duration of the game, then D = LS V,;, So that E(D) < o. 


9? 
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7.5 STATIONARY AND STEADY-STATE DISTRIBUTIONS 


A. stationary distribution for a Markov chain with state space S is a set of 
numbers v,, i¢ S, such that v, > 0, D;<90; = 1, and 


> Pi; = U;, J E S 


7ES 
Thus, if V = (v,;, i¢ S), then VII = V. By induction, VII" = VII (17) = 
Vil"! = --- = VII = V, sothat VII" = Vforalln = 0,1,.... Therefore, 


if the initial state distribution is V,-the state distribution at all future times is 
still V. Furthermore, since 


PER, = L, Raw = ly, eae > Rake = 1, 
= P{R, = 1} Pit, Pig’ * Pip_aip (Problem 4, Section 7.1) 
= U;P ii, Piyig © °° Pig_yey 


the sequence {R,,} is stationary; that is, the joint probability function of 
Ras Rayas+ ++» Ry iz, does not depend on n. 

Stationary distributions are closely related to limiting probabilities. The 
main result is the following. 


Theorem 1. Consider a Markov chain with transition matrix [p,,|. Assume 


lim pi; = 4; 


n-> 00 


for all states i, j (where q; does not depend oni). Then 


(a) dies 4s SJ and Dicg GP = Ui 5€S. 

(b) Either all q; = 0, or else > ,-99; = 1. 

(c) If all q; = 0, there is no stationary distribution. If >.-3q; = 1, then 
{q;$ is the unique stationary distribution. 


PROOF. 
> 4a; = Slim p\” < lim inf ¥ pi? 
J n n j 


J 


by Fatou’s lemma; hence 


Now 


(n) (n-+1) 


» dP; = > (lim py7)p;; < lim inf >, Pies Pig = lim inf ps = q; 
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But if >: 9:Pu, < ;, for some jy, then 
> 4: > > > UPis = > 4: >, Pi = > 4i 


which is a contradiction. This proves (a). 
Now if Q=(q,,ieS), then by (a), QI] = Q; hence, by induction, 
QII” = Q, that is, >, ¢,p(” = q;. Thus 


q; = lim > api}? = & a; lim pz}? 


by the dominated convergence theorem. Hence q; = (>, 49;)9;, proving (b). 

Finally, if {v,} is a stationary distribution, then >. ¥,p\” = v,;. Let n— oo 
to obtain >; v,g; = v;, so that g; = v;. Consequently, if a stationary distri- 
bution exists, it is unique and coincides with {g,;}. Therefore no stationary 
distribution can exist if allg; = 0; if >;q; = 1, then, by (a), {q;} is station- 
ary and the result is established. 


The numbers v,, i € S, are said to form a steady-state distribution iff lim, ,, 
p\? = 0; for alli, 7¢ S, and >., v; = 1. Thus we require that limiting prob- 
abilities exist (independent of the initial state) and form a probability 
distribution. 

In the case of a finite chain, a set of limiting probabilities that are inde- 
pendent of the initial state must form a steady-state distribution, that is, 
the case in which all g; = 0 cannot occur in Theorem 1. For ><, p{”) = 1 
for all ie S; let n > © to obtain, since S is finite, > 9g, = 1. If the chain 
is infinite, this result is no longer valid. For example, if all states are transient, 
then p{”) — 0 for all i, 7. 

If {g,;} is a steady-state distribution, {g,} is the unique stationary distribu- 
tion, by Theorem 1. However, a chain can have a unique stationary distribu- 
tion without having a steady-state distribution, in fact without having limiting 
probabilities. We give examples later in the section. 

We shall establish conditions under which a steady-state distribution exists 
after we discuss the existence and uniqueness of stationary distributions. 

Let N be the number of positive recurrent classes. 


CasE 1. N=0O. Then all states are transient or recurrent null. Hence 
pi” — 0 for all i, 7 by Theorem 2 of Section 7.4, so that, by Theorem 1 
of this section, there is no stationary distribution. 


CASE2. N = 1. Let C be the unique positive recurrent class. If C is aperiodic, 
then, by Theorem 2 of Section 7.4, p‘” — 1/u,;, i, 7¢ C.Ifj ¢ C, then 
jis transient or recurrent null, so that p‘” — 0 for all i. By Theorem 1, 
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if we assign v; = 1/u,;, 7€C, v; = 0,7 € C, then {v,} is the unique 
stationary distribution, and p‘”) — v, for all i, 7. : 

Now assume C periodic, with period d > 1. Let D be a cyclically 
moving subclass of C. The states of D are recurrent and aperiodic 
relative to the transition matrix II¢. By Theorem 2 of Section 7.4, 
pi? —> diu;, i, 7€ D; hence {d/u,;, 7 € D} is the unique stationary 
distribution for D relative to II¢ (in particular, > .p 1/u; = 1/d). It 
follows that v; = 1/u,;, 7€C, v; = 0,7 EC, gives the unique station- 
ary distribution for the original chain (see Problem 1). 


Case 3. N > 2. There is a unique stationary distribution for each positive 
recurrent class, hence uncountably many stationary distributions for 
the original chain. For if VII = V,, VII = V2, then, if a, a, > 0, 
a, + a = 1, we have 


(a,V, + agV.)I1 = aV, + a.Ve 


In summary, there is a unique stationary distribution if and only if 
there is exactly one positive recurrent class. 


Finally, we have the basic theorem concerning steady-state distributions. 


Theorem 2. 

(a) If there is a steady-state distribution, there is exactly one positive 
recurrent class C, and this class is aperiodic; also, f,; = 1 for all jE C 
and allie S. 

(b) Conversely, if there is exactly one positive recurrent class C, which is 
aperiodic, and, in addition, f,; = 1 for all j€C and all ie S, then 
a steady-state distribution exists. 


PROOF. 

(a) Let {v,} be a steady-state distribution. By Theorem 1, {v;} is the unique 
Stationary distribution; hence there must be exactly one positive 
recurrent class C. Suppose that C has period d> 1, and let ica 
cyclically moving subclass Cy, j € C,. Then p'"** — d/u; by Theorem 2 
of Section 7.4, and p{”) = 0 for all n. Since d/u; > 0, pj)” has no limit 
as n—> 00, contradicting the hypothesis. If je C and ie S, then by 
Theorem 2(b) of Section 7.4, pi” — f,,/u;, hence v; = f,;/u;. Since 
v; does not depend on i, we have f;; = f;; = 1. 

(b) By Theorem 2(b) of Section 7.4, 

pip le 
Li; 
—0 for all iif jE C 


for all i, if jEC 
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since in this case j is transient or recurrent null. Therefore, if f,, = 1 
for all ic S and j EC, the limit v; is independent of i. Since C is 
positive, v; > 0 for jEC; hence, by Theorem 1, >, v; = 1 and the 
result follows. 

[Note that if a steady state distribution exists, there are no recurrent null 
classes (or closed transient classes). For if D is such a class and i € D, then 
since D is closed, f,; = 0 for all j € C, a contradiction. Thus in Theorem 2, 
the statement “there is exactly one positive recurrent class, which is aperiodic” 
may be replaced by “there is exactly one recurrent class, which is positive and 
aperiodic’’.] 


COROLLARY. Consider a finite chain. 

(a) A steady-state distribution exists iff there is exactly one closed equiva- 
lence class C, and C is aperiodic. 

(b) There is a unique stationary distribution iff there is exactly one closed 
equivalence class. 


Proor. The result follows from Theorem 2, with the aid of Theorem 8c 
of Section 7.3, Theorem 2e of Section 7.4, and the fact that if B is a finite 
set of transient states, the probability of remaining forever in B is 0 (see the 
remark after Theorem 4 of Section 7.3). 

[It is not difficult to verify that a finite chain has at least one closed 
equivalence class. Thus a finite chain always has at least one stationary 
distribution. | 


REMARK. Consider a finite chain with exactly one closed equivalence class, 
which is periodic. Then, by the above corollary, there is a unique 
stationary distribution but no steady-state distribution, in fact no 
limiting probabilities (see the argument of Theorem 2a). For example, 
consider the chain with transition matrix 


i = b q 
1 0 
The unique stationary distribution is (1/2, 1/2), but 


Il” = h A n even 


01 
m= |i dl! mbdd 
1 0 


and therefore II” does not approach a limit. 


Usually the easiest way to find a steady-state distribution {u,}, if it exists, 
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is to use the fact that a steady-state distribution must be the unique stationary 
distribution. Thus we solve the equations 


> U; Piz = V3; jes 
zEeS 


under the conditions that all v; > 0 and >... 0; = 1. 


PROBLEMS 


1. 


Show that if there is a single positive recurrent class C, then {1/u,, 7€ C}, with 
probability 0 assigned to states outside C, gives the unique stationary distribution 
for the chain. HINT: p{”® = $7-¢ p‘"¢—"p,;, i€ C. Use Fatou’s lemma to show 
that 1/u,; > Speo(l/ Pr Then use the fact that ¥-g 1/u; = 1. 


. (a) If, for some N, IY has a column bounded away from 0, that is, if for some 


jy and some 6 > 0 we have p{X) > 6 > 0 for all i, show that there is exactly 
one recurrent class (namely, the class of jg); this class is positive and aperiodic. 

(b) In the case of a finite chain, show that a steady-state distribution exists iff 
TIN has a positive column for some N. 


. Classify the states of the following Markov chains. Discuss the limiting behavior 


of the transition probabilities and the existence of steady-state and stationary 
distributions. 

. Simple random walk with no barriers. 

. Simple random walk with absorbing barrier at 0. 

. Simple random walk with absorbing barriers at 0 and 5. 

. Simple random walk with reflecting barrier at 0. 

. Simple random walk with reflecting barriers at 0 and / + 1. 

. The chain of Example 2, Section 7.1. 

. The chain of Problem 4c, Section 7.3. 

. The chain of Problem 4d, Section 7.3. 

A sequence of independent random variables (Problem 5, Section 7.3). 
. The chain with transition matrix 
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Introduction to Statistics 


8.1 STATISTICAL DECISIONS 


Suppose that the number of telephone calls made per day at a given exchange 
is known to have a Poisson distribution with parameter 0, but 0 itself is 
unknown. In order to obtain some information about 0, we observe the 
number of calls over a certain period of time, and then try to come to a 
decision about 0. The nature of the decision will depend on the type of in- 
formation desired. For example, it may be that extra equipment will be 
needed if 0 > 65, but not if 6 < 4 . In this case we make one of two possible 
decisions: we decide either that 0 > 6 or that 0 < 6. Alternatively, we may 
want to estimate the actual value of 0 in order to know how much equipment 
to install. In this case the decision results in a number 6, which we hope is as 
close to 0 as possible. In general, an incorrect decision will result in‘a loss, 
which may be measurable in precise terms, as in the case of the cost of un- 
necessary equipment, but which also may have intangible components. For 
example, it may be difficult to assign a numerical value to losses due to 
customer complaints, unfavorable publicity, or government investigations. 

Decision problems such as the one just discussed may be formulated 
mathematically by means of a statistical decision model. The ingredients of the 
model are as follows. 

1. N, the set of states of nature. 
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2. A random variable (or random vector) R, the observable, whose 
distribution function F, depends on the particular 6¢ N. We may 
imagine that “nature” chooses the parameter 0 € N (without revealing 
the result to us); we then observe the value of a random variable R with 
distribution function F,. In the above example, N is the set of positive 
real numbers, and F, is the distribution function of a Poisson random 
variable with parameter 0. 

3. A, the set of possible actions. In the above example, since we are trying to 
determine the value of 0, A = N = (0, 0). 

4. A loss function (or cost function) L(6,a),0¢€N, ae A; LO, a) repre- 
sents our loss when the true state of nature is 0 and we take action a. 

The process by which we arrive at a decision may be described by means of 
a decision function, defined as follows. 

Let E be the range of the observable R (e.g., E1 if R is a random variable, 
E” if R is an n-dimensional random vector). A nonrandomized decision 
function is a function y from E to A. Thus, if R takes the value x, we take 
action y(x). g is to be chosen so as to minimize the loss, in some sense. 

Nonrandomized decision functions are not adequate to describe all aspects 
of the decision-making process. For example, under certain conditions we 
may flip a coin or use some other chance device to determine the appropriate 
action. (If you are a statistician employed by a company, it is best to do this 
out of sight of the customer.) The general concept of a decision function is 
that of a mapping assigning to each x € E a probability measure P, on an 
appropriate sigma field of subsets of A. Thus P,(B) is the probability of taking 
an action in the set B when R = x is observed. A nonrandomized decision 
function may be regarded as a decision function with each P, concentrated 
on a single point; that is, for each x we have P,{a} = 1 for some a (= ¢(2)) 
in A. 

We shall concentrate on the two most important special cases of the 
statistical decision problem, hypothesis testing and estimation. 

A typical physical situation in which decisions of this type occur is the 
problem of signal detection. The input to a radar receiver at a particular 
instant of time may be regarded as a random variable R with density f,, 
where 6 is related to the signal strength. In the simplest model, R = 6 + R’, 
where R’ (the noise) is a random variable with a specified density, and 0 is a 
fixed but unknown constant determined by the strength of the signal. We 
may be interested in distinguishing between two conditions: the absence of a 
target (0 = 6)) versus its presence (9 = 6,); this is an example of a hypothesis- 
testing problem. Alternatively, we may know that a signal is present and wish 
to estimate its strength. Thus, after observing R, we record a number that 
we hope is close to the true value of 0; this is an example of a problem in 
estimation. 
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As another example, suppose that 6 is the (unknown) percentage of 
defective components produced on an assembly line. We inspect n components 
(i.e., we observe R,,...,R,, where R; = 1 if component 7 is defective, 
R, = 0 if component i is accceptable) and then try to say something about 
0. We may be trying to distinguish between the two conditions 0 < 0) and 
0 > 6, (hypothesis testing), or we may be trying to come as close as possible 
to the true value of 0 (estimation). 

In working the specific examples in the chapter, the table of common 
density and probability functions and their properties given at the end of the 
book may be helpful. 
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Consider again the statistical decision model of the preceding section. Sup- 
pose that Hy and H;, are disjoint nonempty subsets of N whose union is N, 
and our objective is to determine whether the true state of nature 6 belongs to 
H, or to H,. (In the example on the telephone exchange, Hy might corre- 
spond to 0 < 6), and H, to 6 > 6.) Thus our ultimate decision must be 
either “0 ¢ H,” or “6 € H,,” so that the action space A contains only two 
points, labeled 0 and 1 for convenience. 

The above decision problem is called a hypothesis-testing problem; Hy 
is called the null hypothesis, and H, the alternative. Hy is said to be simple iff 
it contains only one element; otherwise H, is said to be composite, and simi- 
larly for H,. To take action | is to reject the null hypothesis Hy; to take action 
0 is to accept Ay. 

We first consider the case of simple hypothesis versus simple alternative. 
Here H, and H, each contain one element, say 9) and 6;. For the sake of 
definiteness, we assume that under H, R is absolutely continuous with 
density fy, and under H,, R is absolutely continuous with density f,. (The 
results of this section will also apply to the discrete case upon replacing 
integrals by sums.) Thus the problem essentially comes down to deciding, 
after observing R, whether R has density fp or f,. 

A decision function may be specified by giving a (Borel measurable) 
function gy from £ to [0,1], with g(x) interpreted as the probability of 
rejecting H, when x is observed. Thus, if y(~) = 1, we reject Hy; if v(x) = 0, 
we accept H); and if w(x) = a, 0 < a < 1, we toss a coin with probability 
a of heads: if the coin comes up heads, we reject Hp; if tails, we accept Hy. 
The set {x: p(x) = 1} is called the rejection region or the critical region; 
the function ¢ is called a test. The decision we arrive at may be in error in 
two possible ways. A type 1 error occurs if we reject Hy when it is in fact true, 
and a type 2 error occurs if H, is accepted when it is false, that is, when F, is 
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true. Now if H is true and we observe R = 2, an error will be made if H, 
is rejected ; this happens with probability g(x). Thus the probability of a type 
I error is 


a [- g(x) fo(x) dx (8.2.1) 


Similarly, the probability of a type 2 error is 


B = { " — oa) fla) de (8.2.2) 


Note that « is the expectation of y(R) under Hy, sometimes written E, ~; 
similarly, 6 = 1 — E, 9. 

It would be desirable to choose @ so that both « and 6 will be small, but, 
as we shall see, a decrease in one of the two error probabilities usually 
results in an increase in the other. For example, if we ignore the observed 
data and always accept Ho, then « = 0 but # = 1. 

There is no unique answer to the question of what is a good test; we shall 
consider several possibilities First, suppose that there is a nonnegative cost 
c, associated with a type i error, i = 1,2. (For simplicity, assume that the 
cost of a correct decision is 0.) Suppose also that we know the probability p 
that the null hypothesis will be true. (p is called the a priori probability of Hp. 
In many situations it will be difficult to estimate; for example, in a radar 
reception problem, H, might correspond to no signal being present.) 

Let @ be a test with error probabilities «(y) and B(¢~) The over-all average 
cost associated with @ is 


B(p) = peya(y) + (1 — p)coh(¢) (8.2.3) 


B(¢) is called the Bayes risk associated with y; a test that minimizes B(q¢) 
is called a Bayes test corresponding to the given p, cy, Cs, fo, and fq. 

The Bayes solution can be computed in a straightforward way. We have, 
from (8.2.1-8.2.3), 


Oy) = | ° [pepe fala) +L = peal — oe) f(@)] ae 


= [= eartperfol@) — CL = pleafi(e)l de + (1 = pee (8.2.4 


Now if we wish to minimize fg y(x)g(x) dx and g(x) < 0 on S, we can do no 
better than to take v(x) = 1 for all x in S; if g(x) > 0 on S, we should take 
p(x) = 0 for all x in S; if g(z) = 0 on S, (x) may be chosen arbitrarily. 
In this case g(x) = pc, fo(~) — (1 — p)cefi(x), and the Bayes solution may 
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therefore be given as follows. 


Let L(x) = f,(x)/fo(2). 

If L(x) > pe,/(1 — p)co, take v(x) = 1; that is, reject Hp. 
If L(x) < pe,/(1 — p)co, take v(x) = 0; that is, accept Hp. 
If L(x) = pe,/(1 — p)cs, take p(x) = anything. 


L is called the likelihood ratio, and a test g such that for some constant A, 
0<A< w, o(x) =1 when L(x) >A and g(x) =0 when L(x) < A, is 
called a likelihood ratio test, abbreviated LRT. 

To avoid ambiguity, if f(~) > 0 and f(x) = 0, we take L(x) = oo. The 
set on which f,(x) = fo(x) = 0 may be ignored, since it will have probability 
0 under both Hy and H,. Also, if we observe an x for which f,(x) > 0 and 
fo(x) = 0, it must be associated with H,, so that we should take g(x) = 1. It 
will be convenient to build this requirement into the definition of a likelihood 
ratio test: if L(x) = 00 we assume that (x) = 1. 

In fact, likelihood ratio tests are completely adequate to describe the 
problem of testing a simple hypothesis versus a simple alternative. This 
assertion will be justified by the sequence of theorems to follow. 

From now on, the notation P,(B) will indicate the probability that the 
value of R will belong to the set B when the true state of nature is 0. 


Theorem I. For any «a, 0 < «<1, there is a likelihood ratio test whose 
probability of type I error is «. 


Proor. If « =0, the test given by g(x) = 1 if L(@v) = -; o(x) = 0 if 
L(x) < 0, is the desired LRT, so assume «>0. Now G(y) = P, iu: 
L(x) < y}, —2w< y < , is a distribution function [of the random vari- 
able L(R); notice that L(R) > 0, and L(R) cannot be infinite under A)l. 
Thus either we can find A, 0 < A < o, such that G(A) = 1 — a, or else G 
jumps through 1 — a; that is, for some A we have G(A-) < 1 — a < G(A) 
(see Figure 8.2.1). Define 


p(x) = | if L(x) >A 
=0 ifl(z)<A 
=a if L(x) = A 
where a= [G(A) — (1 — o@)J/[GQ”) —GQ)] if GA>G(A), a=an 


arbitrary number in [0, 1] if GA) = G(/-). Then the probability of a type 1 
error 1s 


P, {e: L(e) > A} + aP, {a: L(w) = 4} = 1 — G(A) + alG() — GA] = « 


as desired. 
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FiGureE 8.2.1 


A test is said to be at level a if its probability « of type 1 error is <a. 
a itself is-called the size of the test, and 1 — f, the probability of rejecting 
the null hypothesis when it is false, is called the power of the test. 

The following result, known as the Neyman-Pearson lemma, is the funda- 
mental theorem of hypothesis testing. 


Theorem 2. Let y, be « LRT with parameter 4 and error probabilities 
a, and B,. Let p be an arbitrary test with error probabilities « and B; ifa < a, 
then B > B,. In other words, the LRT has maximum power among all tests at 
level «,. 


We give two proofs 


FirsT PRoor. Consider the Bayes problem with costs cy = c, = 1, and 
set A = pc,/(1 — p)cp = p/(1 — p). Assuming first that A < 0, we have 
p = A/(1 + A). Thus g, is the Bayes solution when the a priori probability is 
p=A/(l +A). : 

If B < B,, we compute the Bayes risk [see (8.2.3)] for p = A/(1 + A), 
using the test . 

B(p) = pa + (1 — p)p 


But « < a, by hypothesis, while 6 < B, and p < 1 by assumption. Thus 
B(y) < B(,), contradicting the fact that gy, is the Bayes solution. 

It remains to consider the case A = 00. Then we must have 9¢,(x) = 1 if 
L(x) = ow, 9, (x) = 0 if L(x) < ow. Then «, = 0, since L(R) is never infinite 
under Hy; consequently « = 0, so that, by (8.2.1), y(afo(x) = 0 [strictly 
speaking, 9(x)f (x) = 0 except possibly on a set of Lebesgue measure 0]. 
By (8.2.2), 


p=] A= oe fi) de + | 


{e:L(x2)= 


_ (1 — 9(2)) fi(%) dx 
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If L(x) < oo, then f(x) > 0; hence y(x) = 0. Thus, in order to minimize 8, 
we must take g(x) = 1 when L(x) = oo. But this says that 8 > 6,, com- 
pleting the proof. 


SECOND Proor. First assume A < 00. We claim that [,(x%) — (x)] x 
[fs(a) — Afo(x)] > O for all x. For if f(x) > Afo(x), then v,(x) = 1 > ¢(x), 
and if f(x) < Afo(x), then y,(z) = 0 < (a). Thus 


{ ” tp) — e(@)ILi(e) — Afel2)) > 0 
By (8.2.1) and (8.2.2), 


1— 6B,—( — p) — Aa, + An > 0 
or 
B — B, > Aa, — «) > 0 


The case A = 00 is handled just as in the first proof. 


If we wish to construct a test that is best at level « in the sense of maximum 
power, we find, by Theorem 1, a LRT of size «. By Theorem 2, the test has 
maximum power among all tests at level «. We shall illustrate the procedure 
with examples and problems later in the section. 

Finally, we show that no matter what criterion the statistician adopts in 
defining a good test, he can restrict himself to the class of likelihood ratio 
tests. 

A test » with error probabilities « and f is said to be inadmissible iff there 
is a test p’ with error probabilities «’ and 6’, with a’ < a, B’ < f, and either 
a’ < aor f’ < f. (in this case we say that 9’ is better than .) Of course, 
is admissible iff it is not inadmissible. 


Theorem 3. Every LRT is admissible. 


Proor. Let y, be a LRT with parameter 4 and error probabilities «, and 
B,, and @ an arbitrary test with error probabilities « and 8. We have seen that 
if a < a,, then 6 > B,. But the Neyman-Pearson lemma is symmetric in Hy 
and H,. \n other words, if we relabel H, as the null hypothesis and Hy as the 
alternative, Theorem 2 states that if 6 < B,, then « > «,; the result follows. 


Thus no test can be better than a LRT. In fact, if » is any test, then there is 
a LRT 9, that is as good as y; that is, a, < a and f, < f. For by Theorem 1 
there is a LRT , with «, = a, and by Theorem 2 f, < f. This argument 
establishes the following result, essentially a converse to Theorem 3. 
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Theorem 4. If yp is an admissible test, there is a LRT with exactly the same 
error probabilities. 


Proor. As above, we find a LRT 9, with x, = « and 8, < B; since ¢ is 
admissible, we must have fp, = P. 


> Example 1. Suppose that under A, R is uniformly distributed between 
0 and 1, and under H,, R has density 327, 0 < x < 1. For short we write 


Ay: f(e)=1, O<e<l 
H,: f(%) = 32’, 0<xr<l 


We are going to find the risk set S, that is, the set of points (a(y), B(@)) 
where y ranges over all possible tests. [The individual points («(¢), B(¢)) 
are called risk points.) We are also going to find the set S, of admissible 
risk points, that is, the set of risk points corresponding to admissible tests. 
By Theorems 3 and 4, S, is the set of risk points corresponding to LRTs. 

First we notice two general properties of S. 

1. S is convex; that is, if Q, and Q, belong to S, so do all points on the 
line segment joining Q, to Q,. In other words, (1 — a)Q, + aQ, €S for all 
aeé [0, 1]. 

For if Q; = («(91), B(gi)), Qe = («(%2), B(p2)) and O<a<l, let 
g = (1 — a)y, + agg. Then is a test, and by (8.2.1) and (8.2.2), a(@) = 
(1 — a)a(g,) + ax(y,), B(y) = (1 — a)B(q1) + aB(y2). If OQ = (a(9), 
B(¢)), then QO ES, since is a test, and QO = (1 — a)Q, + aQy. 

2. S is symmetric about (1/2, 1/2); that is, if |e], |6] < 1/2 and (1/2 — e, 
1/2 — 6) eS, then (1/2 + ¢, 1/2 + 6) eS. Equivalently, («, 8) ¢S implies 
(l—«1— eS. 

For if (a(@), B(@)) ES, let y’ = 1 — o; then gq’ is a test and a(g’) = 
1 — a(y), B(p’) = 1 — B(¢). 

To return to the present example, we have L(x) = 327, 0 < x < 1. Thus 
the error probabilities for a LRT with parameter A < 3 are 


AM? Ayil2 
a= Py {x: L(*) > A} = Pal tee (3) | — a 


B = Py ix: L(x) < A} = Pr,{ oe (5) | 


(a/3)1/2 4y3/2 
-| 32% dx = (5) =(1— 2a)? 


(df A> 3, thn «=0, B=1.) Thus S, = {(a, (1 — «)%):0 < « < I}. 
Since no test can be better than a LRT, S,, is the lower boundary of the 
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set S; hence, by symmetry, {1 —o«,1—(U—a))0<ca<i}= 
{(a, 1 — a):0 < a < 1} is the upper boundary of S. Thus S$ must be 
{(a, B):0 << a <1, (1 — a)? < BP < 1 — a} (see Figure 8.2.2). 

Various tests may now be computed without difficulty. We give some 
typical illustrations. 

(a) Find a most powerful test at level .15. Set « = .15 = 1 — (A/3)¥?, 
Since L(x) > A iff x > (A/3)”?, the test is given by 


p(x) = 1 if x > .85 
=0 if¢< .85 
= anything if x = .85 


We have 6 = (1 — «)? = (.85)? = .614. 
(b) Find a Bayes test corresponding to c, = 3/2, c, = 3, p = 3/4. This 
is a LRT with A = pe,/(1 — p)co = 3/2; that is, 


AM? 7) 
Q«)=1 if*> (=) _v? — .707 
3 2 
2 
= () ree 


= anything ita rs 


Thus « = 1 — (A/3)/? = .293, B = (1 — «)®, and the Bayes risk may be 
computed using (8.2.3). 
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a=1- V2/2 1 


FIGuRE 8.2.3 Geometric Interpretation of Bayes Solution. 


The Bayes solution may be interpreted geometrically as follows. We 
are trying to find a test that minimizes the Bayes risk pcja + (1 — p)eoB = 
(9/8) + (3/4)8. If we vary c until the line (9/8)« + (3/4)6 = c intersects 
S4, we find the desired test (see Figure 8.2.3). 

Notice also that to find the Bayes solution we may differentiate (9/8)a + 
(3/4)(1 — a)? and set the result equal to zero to obtain a = 1 — J 2/2, as 
before. 

(c) Find a minimax test, that is, a test that minimizes max (a, f). It is 
immediate from the definition of admissibility that an admissible test with 
constant risk (i.e., « = £) is minimax. Thus we set « = 6 = (1 — a)’, which 
yields « = .318 (approximately). Therefore (A/3)/? = 1 — « = .682, and so 
we reject Hy if x > .682 and accept Hy if x < .682. < 


p> Example 2. Let R be a discrete random variable taking on only the 
values 0, 1, 2, 3. Let the probability function of R under H, be p;, i = 0, 1, 
where the p, are as follows. 


x 0 1 2 3 


Po(*) 1 2 . 
Pr(x) se 1 4 3 


The appropriate likelihood ratio here is L(x) = p,(x)/po(~). Arranging the 
values of L(x) in increasing order, we have the following table. 


Xx 


L(x) 


0 
2 


Blew Uo 
we) 


pe 
ote 
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We may therefore describe the LRT with parameter A as follows. 


LRT Rejection Region Acceptance Region ot i) 
0<A’<t All x Empty I 0 
lJ2<A<F% x= 0,2,3 c=] 8 1 
3/44<A<éd w=.) x= 1,3 4 4 
4/3 <A<2 x= 0 z= 1,2,3 I 8 
2< AK ow Empty All x 0 I 


Now assume / = 3/4. Then we reject Hy if x =O or 2, accept Hy if 
% = 1, and if = 3 we randomize, that is, reject Hy with probability a, 
0<a< 1. Thus 


a% = po(0) + po(2) + ap,(3) = .4 + .4a 
B = p,) + (1 — ap, (3) = .1 + 301 — a) 


As a ranges over [0, 1], («, 8) traces out the line segment joining (.4, .4) 
to (.8, .1). Ina similar fashion we calculate the error probabilities for A = 1/2, 
4/3, and 2. The admissible risk points are shown in Figure 8.2.4. 

We compute several tests. 

(a) Find a most powerful test at level .25. Since .1 < .25 < .4, we have 
A = 4/3. Thus we reject H, if x = 0, accept H, if « = 1 or 3, and reject Hy 
with probability a if x = 2, where .1(1 — a) + .4a = .25, so that a = 1/2. 
Notice that 6 = .8(1 — a) + 4a = .6. 

(b) Find a Bayes test with c, = cg = 1, p = .6. We have A = pe,/(1 — p)cs 
= 3/2. Thus we reject Hy if x = 0 and accept Hy otherwise. The error prob- 
abilities are « = .1, 8 = .8, and the Bayes risk is pcyx + (1 — p)coB = .38. 

(c) Find a minimax test. The only admissible test with « = 6 has « = 
B = .4, so that 3/4 < 1 < 4/3. We reject H, when x = 0 or 2 and accept Hy 
if~=1or3. < 


0 (1, 0) 


FIGuRE 8.2.4 Admissible Risk Points When R is Discrete. 
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> Example 3. Let R be normally distributed with mean 6 and variance 
o”, where o” is known. We wish to test the null hypothesis that 06 = 6, against 
the alternative that 0 = 6,, and the test is to be based on n independent 
observations R,,..., R, of R. (Assume 6) < 6.) 

The appropriate likelihood ratio is 


a ~ (a, — 6,)%/26° 
se, afieorseray Grey oe] ~ Rie — aatre! 


 fi(ay, 5 Bn) (270%)-"!? exp [>t ~ 6)2/2 | 
The condition L(x) > A is equivalent to In L(x) > In A; that is, 
3200, — 6,)ay, + n(0,2 — 6,2) > 2c?In A (8.2.5) 
This is of the form >”, x, > c. Thus a LRT must be of the form 
Q(%,...,%,) = 1 it Ym >c 


=0 if }u%<c 
k=1 


n 


=anything if }x,=c 


Now R; + °°: + &, is normal with mean nO and variance no?, so that the 
error probabilities are 


Oo = Pa,{ ig) > ee | 
k=l 


=P, {Ri +-::+R,>c} 


fy) = Pr,{ (a eee 9 x,): 2 < | 
k= 


= Py{Rits:: +R, <c} 
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a 
0 1 


FiGur_E 8.2.5 Admissible Risk Points 
When R is Normal. 


Thus we have parametric equations for a and # with c as parameter, 
—0o <c<¢ o. The admissible risk points are sketched in Figure 8.2.5. 

Suppose that we want a LRT of size «. If N, is the number such that 
1 — F*(N,) = a, then (c — nb.)[N'n o = N,, so that c = n6y + Jn oN,. 

We now apply the results to a problem in testing a simple hypothesis 
versus a composite alternative. Again let R be normal (6, 07), and take 
Hy: 6 = 09, Hy: 0 > 4p. 

If we choose any particular 6, > 0) and test 6 = 0) against 0 = 6,, the 
test described above is most powerful at level «. However, the test is com- 
pletely specified by c, and c does not depend on 6,. Thus, for any 0; > 6p, 
the test has the highest power of any test at level « of 0 = 0, versus 6 = 6. 
Such a test is called a uniformly most powerful (UMP) level « test of 6 = 6, 
versus 0 > Op. 

We expect intuitively that the larger the separation between 6, and 6,, 
the better the performance of the test in distinguishing between the two 
possibilities. This may be verified by considering the power function Q, 
defined by 


Q(9) = Ege 
= the probability of rejecting H, when the true state of nature is 0 


= P,{R, +---+R,>c} 


— 1 F* 5 — " 
Jnc 
Thus Q(6) increases with 0. 
Now if Hy: 6 = 6), H,: 6 = 0,, where 6, < 4, the same technique as 
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above shows that.a size « LRT is of the form 
@(%,.:.,2,) = 1 if }a,<c 
k=1 
=0 if }a>c 
k=1 


=anything if }a,=c 
k=1 


where 


c=nby+./noNy_, 
Again, the test is UMP at level « for 6 = 0) versus 0 < 69, with power 


function 
00) = F*(—") 
Jno 
which increases as 6 decreases (see Figure 8.2.6). 
The above discussion suggests that there can be no UMP level « test of 
0 = @, versus 0 ¥ 6). For any such test y must have power function Q(6) 
for 0 > 05, and Q’(6) for 6 < 6. But the power function of ¢ is given by 


Ey =| oh Q(%1,.. <5 Un) SolX1, . <1 5 ¥_) ae, +> daz, 


where f, is the joint density of n independent normal random variables with 
mean 0 and variance o”. It can be shown that this is differentiable for all 0 
(the derivative can be taken under the integral sign). But a function that is 
QO(0) for 6 > 6, and Q’(6) for 6 < 6, cannot be differentiable at 6,. 


Q(9) 
| Q(9) 


FIGURE 8.2.6 Power Functions. 
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In fact, the test » with power function Q(@) is UMP at level « for the 
composite hypothesis Hy: 0 < 0) versus the composite alternative H,: 
0 > 6. Let us explain what this means. 

g is said to be at level « for Hy versus H, iff Esp < « for all 0 < 4); 
is UMP at level « if for any test p’ at level « for H) versus H, we have 
E,yv’ < Eo¢@ for all 6 > 6p. 

In the present case E,y = Q(0) < « for 0 < 4) by monotonicity of Q(6), 
and E,y’ < E,p for 6 > 45, since g is UMP at level « for 6 = 6, versus 
0 > Oo. 

The underlying reason for the existence of uniformly most powerful tests 
is the following. If 6 < 6’, the likelihood ratio f,.(x)/f,(x) can be expressed 
as a nondecreasing function of ¢(”) [where, in this case, ¢(v) = 7, +°-°-+ 
z,; see (8.2.5)]. Whenever this happens, the family of densities f, is said to 
have the monotone likelihood ratio (MLR) property. 

Suppose that the f, have the MLR property. Consider the following test 
of 0 = 0, versus 0 = 6,, 0, > Oo. 


g(x) = 1 if (7) >c 
= 0 if t(z) <<c 
=a if (x) =c 


where P, {x: t(x) > c} + aP, {x: t(%) = c} = a (notice that c does not 
depend on 0,). Let A be the value of the likelihood ratio when ¢(~) = c; then 
L(x) > A implies ¢(~) > c; hence g(x) = 1. Also L(x) < A implies t(x) < c, 
so that v(x) = 0. Thus is a LRT and hence is most powerful at level «. 
We may make the following observations. 

1. gis UMP at level « for 6 = 6) versus 0 > 4p. 

This is immediate from the Neyman-Pearson lemma and the fact that c 
does not depend on the particular 0 > 0). 

2. If 0; < 0,, p is the most powerful test at level a, = E,,y for 6 = 0, 
versus 0 = 6. 

Since m is a LRT, the Neyman-Pearson lemma yields this result immedi- 
ately. 

3. If 0; < 02, then E, y < E,,¢y; that is, y has a monotone nondecreasing 
power function. It follows, as in the earlier discussion, that » is UMP at 
level « for 6 < 6, versus 0 > 6p. 

By property 2, y is most powerful at level «, = E, » for 6 = 0, versus 
§ = 6,. But the test v(x) = «a, is also at level «,; hence E,,.y' < E,.¢, that 
IS, % = Egy < £y,9. 


REMARK. Since the Neyman-Pearson lemma is symmetric in Hy and A, if 
0, < 0., then for all tests y’ with B(g’) < B(q), we have E, p < E, ¢’. 
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We might say that is uniformly least powerful for 0 < 0) among all 
tests whose type 2 error is <f whenever 6 > 6. < 


PROBLEMS 


1. Let Ay: fo@) =e", « > 0; Ay: f,@) = 2e, « > 0. 


(a) Find the risk set and the admissible risk points. 
(b) Find a most powerful test at level .05. 
(c) Find a minimax test. 


. Show that the following families have the MLR property, and thus UMP 


tests may be constructed as in the discussion of Example 3. 

(a) pg = the joint probability function of m independent random variables, 
each Poisson with parameter 0. 

(b) po = the joint probability function of n independent random variables R;, 
where R; is Bernoulli with parameter 0; that is, P{R; = 1} = 0, P{R; = 
0} =1-— 06,0 <6 <1; notice that R, +--- +R, has the binomial 
distribution with parameters n and 0. 

(c) Suppose that of N objects, 6 are defective. If n objects are drawn without 
replacement, the probability that exactly x defective objects will be found 
in the sample is 

0\(N-—6 
poe) = SE, x=0,1,...,9(0 =0,1,...,N) 
n 
This is the hypergeometric probability function; see Problem 7, Section 
1.5. 

(d) fo = the joint density of nm independent normally distributed random 

variables with mean 0 and variance 0 > 0. 


. It is desired to test the null hypothesis that a die is unbiased versus the alter- 


native that the die is loaded, with faces 1 and 2 having probability 1/4 and 

faces 3, 4, 5, and 6 having probability 1/8. 

(a) Sketch the set of admissible risk points. 

(b) Find a most powerful test at level .1. 

(c) Find a Bayes solution if the cost of a type 1 error is c,, the cost of a type 
2 error is 2c,, and the null hypothesis has probability 3/4. 


. It is desired to test the null hypothesis that R is normal with mean 6, and 


known variance o* versus the alternative that R is normal with mean 0, = 
6) + o and variance o”, on the basis of n independent observations of R. 
Find the minimum value of n such that « < .05 and 6 < .03. 


. Consider the problem of testing the null hypothesis that R is normal (0, 9p) 


versus the alternative that R is normal (0, 6,), 6, > 9) (notice that in this case 
a UMP test of 6 < 6) versus 0 > 6, exists; see Problem 2d). Describe a most 


*11, 


12. 


13. 


14. 
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powerful test at level « and indicate how to find the minimum number of 
independent observations of R necessary to reduce the probability of a type 2 
error below a given figure. 


. Let R,,...,.R, be independent random variables, each uniformly distributed 


between 0 and 0, 6 > 0. Show that the following test is UMP at level « for 
Hy: 6 = 4 versus H,: 6 ¥ 4. 


P(@1,...,%) =1 if max a; < 6,0" ~— or if max x, > 4 
1<i<n 1<i<n 


= 0 otherwise 


Find the sketch the power function of the test. 


. Consider the test of Problem 6 with H): 0 = 1; H,: 6 = 2. Find the risk set 


and the set of admissible risk points. 


. Let R,, R,, and Rg be independent, each Bernoulli with parameter 0, 0 < 


6 <1. Find the UMP test of size « = .1 of 0 < 1/4 versus 6 > 1/4, and find 
the power function of the test. 


. Show that every admissible test is a Bayes test for some choice of costs c, 


and c, and a priori probability p. Conversely, show that every Bayes test with 
Cc; > 0, c, > 0,0 <p <1 is admissible. Give an example of an inadmissible 
Bayes test with c, > 0, cy > 0. 


. If g is most powerful at level «) and f(y) > 0, show that ¢ is actually of 


size %. Give a counterexample to the assertion if B(y) = 0. 


Let » be a most powerful test at level «. Show that for some constant 4 we 
have 9(v) = 1 if > A; y(~) = 0 if x < A, except possibly for x in a set of 
Lebesgue measure 0. 


A class C of tests is said to be essentially complete iff for any test y, there is a 
test m, € C such that gp, is as good as ¢,. Show that the following classes are 
essentially complete. 

(a) The likelihood ratio tests. 

(b) The admissible tests. 

(c) The Bayes tests (i.e., considering all possible c,, cy, and p). 


Give an example of tests y, and gy, such that the statements “‘g, is as good as 
Pp, and “ge, is as good as 9,” are both false. 


Let R,, Re,... be independent random variables, each with density hg, and 

let Hy: 9 = 0), Hy: 0 = 6,. 

(a) If ¢, is a test based on n observations that minimizes the sum of the error 
probabilities, show that 9,() = 1 if gn(@) = [74 [ho,(%)/h,, @)1 > 1, 
r(x) = Oif g,(@) < 1. Thus 

1 
an + Py, = Po t@: §n(#) >i t+ Pollet > i 
(b) Let ¢(@v,) = [hp (@,)/hy (eV. Show that 


Pose: Gale) > 1} < TI Ey t(R) = 1, RDI 
t=1 
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(c) Show that Ey t(Ry <1; hence «, > 0asn > o. A similar argument with 
6) and 0, interchanged shows that 8, — 0 as n — oo, so that if enough 
observations are taken, both error probabilities can be made arbitrarily 
small. 


8.3 ESTIMATION 


Consider the statistical decision model of Section 8.1. Suppose that y is a 
real-valued function on the set N of states of nature, and we wish to estimate 
y(0). If we observe R = x we must produce a number p(x) that we hope will 
be close to y(9). Thus the action space A is the set of reals E’, and a decision 
function may be specified by giving a (Borel measurable) function p from the 
range of R to E'; such a y is called an estimate, and the above decision 
problem is called a problem of point estimation of a real parameter. 

Although the estimate y appears intrinsically nonrandomized, it is possible 
to introduce randomization without an essential change in the model. If 
R, is the observable, we let R, be a random variable independent of R, and 
0, with an arbitrary distribution function F. Formally, assume P,{R, € B,, 
R, € By} = P,{R, € B}P{R, € By}, where P{R, € B,} is determined by the 
distribution function F and is unaffected by 0. If Rj = x and R, = y, we 
estimate y(0) by a number y(#, y). Thus we introduce randomization by 
enlarging the observable. 

There is no unique way of specifying a good estimate; we shall discuss 
several classes of estimates that have desirable properties. 

We first consider maximum likelihood estimates. Let f, be the density (or 
probability) function corresponding to the state of nature 0, and assume for 
simplicity that y(0) = 0. If R = x, the maximum likelihood estimate of 0 is 
given by p(x) = 6 = the value of 6 that maximizes f,(x). Thus (at least in 
the discrete case) the estimate is the state of nature that makes the particular 
observation most likely. In many cases the maximum likelihood estimate is 
easily computable. 


p Example 1. Let R have the binomial distribution with parameters n and 
6,0 <6 <1, so that p,(x) = (”)0"(1 — 6)”*, « =0,1,...,n. To find 6 
we may set 


@ 
ap ln pol) = 0 


to obtain 


Notice that R may be regarded as a sum of independent random variables 
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R,,..., R,, where R, is 1 with probability 6 and 0 with probability 1 — 0. 
In terms of the R; we have 6(R) = (R, + °°: + R,)/n, which converges in 
probability to E(R,) = 6 by the weak law of large numbers. Convergence in 
probability of the maximum likelihood estimate to the true parameter can 
be established under rather general conditions. < 


> Example2. Let R,,..., R,, be independent, normally distributed random 
variables with mean uw and variance o*. Find the maximum likelihood esti- 
mate of 0 = (u, o”). (Here 6 is a point in E? rather than a real number, but 
the maximum likelihood estimate is defined as before.) 


We have | 
igs | 
fle) = Qnr0*)-" exp er S (2, — w| 
20° i=1 
so that 
nis" Ors He a aay 
ye 20° m1 
Thus 
0 t= nN ._ 
—Inf,(*) ==> (% -w = 54-2) 
Ou O"i=1 G 
where 
- | eee 
G=-) 2, 
nNi=1 
and 


0 12 le 
sg in fda) = —" + 53a — w= 3(-8 +7 le —w) 
0) Oo O i=1 Oo Ni=1 


Setting the partial derivatives equal to zero, we obtain 


where 


(A standard calculus argument shows that this is actually a maximum.) In 
terms of the R,;, we have 


O(R,,...,R,) = (R, V%) 


where R is the sample mean (R, +°-: + R,)/n and Vis the sample variance 
(In) Sn, (Ry — RY 

If the problem is changed so that 0 = pu (i.e., o? is known), we obtain 
§ = Ras above. However, if 9 = o?, then we find 0 = (1/n) >”, (2; — w)?, 
since the equation d In f,(x)/0u = 0 is no longer present. < 
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We now discuss Bayes estimates. For the sake of definiteness we consider 
the absolutely continuous case. Assume NV = E’, and let f, be the density of 
R when the state of nature is 0. Assume that there is an a priori density g 
for 0; that is, the probability that the state of nature will lie in the set B is 
given by |» 2(0) d0. Finally, assume that we are given a (nonnegative) loss 
function L(y(@), a), 0€ N, ac A; L(y(9), a) is the cost when our estimate of 
y(0) turns out to be a. If p is an estimate, the over-all average cost associated 
with yp is 


B(y) = [ ° [ ” 9(8)fo(a)L(y(8), (2) d6 de 


B(w) is called the Bayes risk of », and an estimate that minimizes B(y) is 
called a Bayes estimate. If we write 


By) = | ‘| [sO K@LOO, we do| de (8.3.1) 


it follows that in order to minimize B(y) it is sufficient to minimize the ex- 
pression in brackets for each z. 

Often this is computationally feasible. In particular, let L(y(@), a) = 
(y(0) — a). Thus we are trying to minimize 


[= 2) S20) — yay? a0 


This is of the form Ay?(x) — 2By(x) + C, which is a minimum when 
w(x) = B/A; that is, 
| s@soca)n(0) a0 
Wx) = =a ca (8.3.2) 
[eft 40 
But the conditional density of 6 given R = x is g(6)f,(x)/J?, g(fe(a) dd, 
so that p(x) is simply the conditional expectation of y(@) given R = 2. 

To summarize: To find a Bayes estimate with quadratic loss function, set 
w(x) = the conditional expectation of the parameter to be estimated, given 
that the observable takes the value z. 


p> Example 3. Let R have the binomial distribution with parameters n and 
6,0<0< 1, and let y(0) = 6. Take g as the beta density with parameters 


r and s; that is, 
g7-1 1 =. g s—1 
2(0) = i — 6 


; 0<60<1,r,s>0 
B(r, s) 
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where f(r, s) is the beta function (see Section 2 of Chapter 4). First we find a 
Bayes estimate of 6 with quadratic loss function. 

The discussion leading to (8.3.2) applies, with f,(~) replaced by p,(x) = 
(")6*(1 — 0)" *, 2 = 0,1,...,n. Thus 


*/n 
{ ( jon oo 7 ia d@ 
0 \z 


‘in 
{ ( joa = py 2 d@ 
0 \z 


_Pr+e2+i1,n—2+5) 

b(r+2,n—x-+s) 

_ler+e+ Dr — «+5) Cir +s +n) 
Rr+a(n—2+s) Tr+s+n+4+1) 
i ae 

r+s+n 


yx) = 


Now, for a given 0, the average loss p(9), using y, may be computed as 


follows. 
R 
6)=E Tak. i) | 
Pel?) + (- +s+thn 


1 


— rear cs —néd+r—ré— s6)*| 


Since E,[(R — n0)?] = Var, R = nO(1 — 6) and E,R = nO, we have 


eee eee _ _ rf — <A? 
pa(0) = [nO — 8) + (x — 10 — 56) 
= a (G+ 3)? — me? + (n — 2r(r +8) 0 +77] 


Py is called the risk function of wy; notice that 


BC) = | e()py(0) 40 (8.3.3) 


It is possible to choose r and s so that p,, will be constant for all 6. For this 
to happen, 


n=(r+s) = 2r(r+s) 
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which is satisfied if r = s = J n/2. We then have 


(n+./n)> 41 + /n)? 

Thus in this case p is a Bayes estimate with constant risk; we claim that p 
must be minimax, that is, py minimizes max, p,(0). For if y’ had amaximum 
risk smaller than that of y, (8.3.3) shows that B(y’) < B(y), contradicting 
the fact that w is Bayes. 

Notice that if 6(x) = x/n is the maximum likelihood estimate, then 
w(x) = a,6(x) + b,, where a, > 1,b,—>Oasn—> o. 4 


We have not yet discussed randomized estimates; in fact, in a wide variety 
of situations, including the case of quadratic loss functions, randomization 
can be ignored. In order to justify this, we first consider a basic theorem 
concerning convex functions. 

A function f from the reals to the reals is said to be convex iff f[(1 — a)x + 
ay| < (1 — a) f(x) + af(y) for all real x, y and all ae [0, 1]. A sufficient 
condition for f to be convex is that it have a nonnegative second derivative 

“concave upward” is the phrase used in calculus books). The geometric 
interpretation is that flies on or above any of its tangents. 


Theorem I (Jensen’s Inequality). If R is a random variable, f is a convex 
function, and E(R) is finite, then E[f(.R)| > f [E(R)]. (For example, E[LR?"| > 
[E(R)??”, n = 1,2,....) 


Proor. Consider a tangent to f at the point E(R) (see Figure 8.3.1); let 
the equation of the tangent be y = ax + Db. Since fis convex, f(x) > ax + b 
for all x; hence f(R) > aR + b. Thus E[f(R)] > aE(R) + b = f(E(R)). 


f(x) 


ax+b 


E(R) 


FiGure 8.3.1 Proof of Jensen’s Inequality. 
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We may now prove the theorem that allows us to ignore randomized 
estimates. 


Theorem 2 (Rao-Blackwell). Let R, be an observable, and let R, be 
independent of R, and 0, as indicated in the discussion of randomized estimates 
at the beginning of this section. Let yp = (x, y) be any estimate of y(0) based 
on observation of R, and R,. Assume that the loss function L(y(0), a) is a 


convex function of a for each 0 (this includes the case of quadratic loss). 
Define 


p*(x) = Eglp(Ri, Ro) | Ry = 2] 
(Eop(Ri, Re) is assumed finite.) 
Let p,, be the risk function of p, defined by p,(0) = E,[L(y(9), y(R1, R2))] = 
the average loss, using py, when the state of nature is 0. Similarly, let Py+(0) = 


E,[L(y(6), p*(R,))]. Then p,+(9) < p,(9) for all 0; hence the nonrandomized 
estimate p* is at least as good as the randomized estimate yp. 


PROOF. 
L(y(9), Eoly(Ri, Re) | R, = 2) < E,[L(y(), v(Ri, Re) | R, = 2] 


by the argument of Jensen’s inequality applied to conditional expectations. 
Therefore 


L(y), p*(Ry)) < EglL(y(®), y(Ri, Ro) | R,] 


Take expectations on both sides to obtain 


pyx() S EglL(yO), p(Ris Re) = py(4) 


as desired. 

PROBLEMS | 

1. Let R,,..., R, be independent random variables, all having the same density 
hy; thus fo(%1, ..., 2%) = | [%, 4,(@,). In each case find the maximum likelihood 


estimate of 6. 
(a) h(x) = 6x91, O<e <1,0>0 


1 
(b) Ag(z) =ze*, & 30,0 >0 


1 
(C)hA(@)=7, O<x<6,6>0 
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2. Let R have the Cauchy density with parameter 0; that is, 


fol®) = Ea’ dé>0 


Find the maximum likelihood estimate of 0. 


3. Let R have the negative binomial distribution; that is, (see Problem 6, Section 
6.4), 


Pp &) = P{R = x} = (2-e"01 — 8)", e=rrt+i1,...,0<60<1 
Find the maximum likelihood estimate of 0. 

4. Find the risk function in Example 3, using the maximum likelihood estimate 
6 = 2x/)n. 

5. In Example 3, find the Bayes estimate if @ is uniformly distributed between 0 
and 1. 

6. In Example 3, change the loss function to L(6, a) = (6 — a)?/6(1 — 6), and let 
6 be uniformly distributed between 0 and 1. Find the Bayes estimate and show 
that it has constant risk and is therefore minimax. 

7. Let R have the Poisson distribution with parameter 0 > 0. Find the Bayes 
estimate y of 6 with quadratic loss function if the a priori density is g(6) = e~®. 


Compute the risk function and the Bayes risk using y, and compare with the 
results using the maximum likelihood estimate. 


8.4 SUFFICIENT STATISTICS 


In many situations the statistician is concerned with reduction of data. For 
example, if a sequence of observations results in numbers 2,..., X,, it is 
easier to store the single number x, + --:-+ x, than to record the entire 
set of observations. Under certain conditions no essential information is lost 
in reducing the data; let us illustrate this by an example. 

Let R,,...,R, be independent, Bernoulli random variables with param- 
eter 0; that is, P{R;=1}=0, P{R,;=0}=1-—060, 0<6d<1. Let 
T= 7¢(R,,...,R,) = Ri +++: + R,, which has the binomial distribution 
with parameters n and 6. We claim that P,{R, = %,...,R, =%,|T = y} 
actually does not depend on 6. We compute, for z,=Oorl,i=1,...,n, 


P,{R, = %,...,R, = #,|T = y} 


This is 0 unless y = x, +-+-+-+ + x,, in which case we obtain 
P,{Ry — v1, eee 9 R,, — t,} “(1 —= ) iad 1 


CO cemmenpeymenensteeresvenremsummne es | ey: 
——_ 


— (eee D 
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The significance of this result is that for the purpose of making a statistical 
decision based on observation of R,,...,R,, we may ignore the individual 
R, and base the decision entirely on R, +--+ + R,. To justify this, consider 
two statisticians, A and B. Statistician A observes R,,..., R, and then makes 
his decision. Statistician B, on the other hand, is only given T = R, + +++ + 
R,,. He then constructs random variables R;,..., R), as follows. If T = y, 
let Ri,..., R’, be chosen according to the conditional probability function 
of R,,...,R, given T = y. Explicitly, 


where 4; =O or l,i =1,...,n, 4 +°°:+2%, =y. B then follows A’s 
decision procedure, using R},..., R’. Note that since the conditional prob- 
ability function of R,,...,R, given T= y does not depend on the un- 
known parameter 6, B’s procedure is sensible. Now ifa, +-°-'+2,=y, 


PAR, = fide = ty So RH i eh, Ses CSG 
= P,{T = y}Pp{Ry = %,...,R, = t,| T = y} 


“(reer 


Y 
— 0“(1 ak 0)" 
=P AR = Cie. Ry =a 


Thus (R,,..., Rj) has exactly the same probability function as (R,,..., Ry), 
so that the procedures of A and B are equivalent. In other words, anything 
A can do, B can do at least as well, even though B starts with less informa- 
tion. 

We now give the formal definitions. For simplicity, we restrict ourselves to 
the discrete case. However, the definition of sufficiency in the absolutely 
continuous case is the same, with probability functions replaced by densities. 
Also, the basic factorization theorem, to be proved below, holds in the 
absolutely continuous case (admittedly with a more difficult proof). 

Let R be a discrete random variable (or random vector) whose probability 
function under the state of nature 6 is pg. Let T be a statistic for R, that is, 
a function of R that is also a random variable. T is said to be sufficient for 
R (or for the family p,, 6 € N) iff the conditional probability function of R 
given T does not depend on 0. 

The definition is often unwieldy, and the following criterion for sufficiency 
is useful. 
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Theorem I (Factorization Theorem). Let T = t(R) be a statistic for R. T is 
sufficient for R if and only if the probability function p, can be factored in the 
form p(x) = g(@, t(x))h@). 


Proor. Assume a factorization of this form. Then 


PMrecta9 22 == = 
PAT = y} 
This is 0 unless ¢(~) = y, in which case we obtain 


P{R= 2} g(6, 1(2))h(2) 
P.{T = y} > RC t(z))h(z) 


{z:t(z)= 
__ (0, h(2) 
> 8(9, y)h(z) 


{z:t(z)=y} 


pe 
> he’ 


{z:t(z)=y} 
Conversely, if T is sufficient, then 
pot) = PR = 2} = PR = 2, T = 1(0)} 
= P{T = t(a)}P, {R= x | T = t(x)} 
= 2(0, t(x))h(x) by definition of sufficiency 


which is free of 0 


p> Example 1. Let R,,..., R, be independent, each Bernoulli with param- 
eter 0. Show that R, +--- + R, is sufficient for (R,,..., R,). 

We have done this in the introductory discussion, using the definition of 
sufficiency. Let us check the result using the factorization theorem. If 
“=X +°°+'+2,,2; = 0,1, and ¢(v) = 2z,+---+2,, then 


Po(¥15 +--+ 9 Zp) = OX(L — O) 2) 


which is of the form specified in the factorization theorem [with h(x) = 1]. < 


p> Example 2. Let R,,...,R, be independent, each Poisson with param- 
eter 0. Again R, +°-:+ R, is sufficient for (Ri,...,R,). (Notice that 
R, +°::+ + R, is Poisson with parameter 70.) 

For 


Di iscany Ce) SP Ry Sigs Ry Ss Cid ee = Oise Se 


= IT Poik: a x; \ 


eo Mgt “*+an 


elrot ag! 


n 
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The factorization theorem applies, with g(0, ¢(x)) = oO gt h(x) = 
Lael ace elt) = a2) -p 4a. 


p> Example 3. Let R,,..., R, be independent, each normally distributed 
with mean yw and variance o?. Find a sufficient statistic for (R,,..., R,) 
assuming 

(a) wand o* both unknown; that is, 6 = (uw, o”). 

(b) o? known; that is, 06 = u 

(c) « known; that is, 0 = o”. 
[Of course (R,,..., R,,) is always sufficient for itself, but we hope to reduce 
the data a bit more.] We compute 


fy(2t) = (270)! exp eer? (2, — w| 


(8.4.1) 
Let 
b=! 3a st="Y@,— 9 
ni=1 Ni=1 
Since x, — % = x, — uw — (% — p), we have 
gee 7 
s* = 7D (i — pw)” — (@ — pw)’ 
Thus 
fol 2) —_ (270°) -"/*e —ns*/20*, —n(£—p)*®/207 (8. 4. 2) 


By (8.4.2), if w and o? are unknown, then [take h(x) = 1] (R, V2) is sufficient, 
where R is the sample mean (I/n) >”, R, and V? is the sample variance 
(1 [n) >”, (R; — R)*. If o? is known, tien the term (270?)-"/2e-"s"/20" can 
be taken. as h(x) in the factorization theorem; hence R is sufficient. If u 
is known, then, by (8.4.1), >”, (R; — w)* is sufficient. < 


PROBLEMS 


1. Let R,,..., R, be independent, each uniformly distributed on the interval 
[0,, 9,]. Find a sufficient statistic for (R,,..., R,), assuming 
(a) 6,, 6. both unknown 
(b) 6, known 
(c) 6, known 
2. Repeat Problem 1 if each R; has the gamma density with parameters 0, and 9,, 
that is, 
791 1¢-a/ 62 


fO) = TE yee x > 0, 0,,9, > 0 
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3. Repeat Problem 1 if each R; has the beta density with parameters 6, and 6, 
that is, 
91-11 — a)92-1 


B(O,, 9)” 


4. Let R, and R, be independent, with R, normal (6, o”), R, normal (6, 7”), where 
o* and 7” are known. Show that R,/o? + R,/7? is sufficient for (R,, Ra»). 


5. An exponential family of densities is a family of the form 


f@ = O<2 <1, 6,,0,>0 


k 
fo(x) = a(0)b(2) exp] > cece | : x real, OE N 
i=1 
(a) Verify that the following density (or probability) functions can be put into 
the above form. 
(i) Binomial (7, 6): p,(x) = ("er — 9)r-®, x=0,1,...,n7,0<0<1 


—é Ax 


ees 0 
(ii) Poisson (6): p,(#) res x=0,1,...,0>0 


(iii) Normal (u, 0”): f(x) = e(a—u)?/26? = (pu, 0) 


V0 o 
w81-le—2/0 

T'(6,)65% ° 
91-11 — x) 62-1 


B(6,, 89) : 


(iv) Gamma (6,, 69): fev) = x > 0, 0 = (6,, 6,), 6,, 0. > 0 


(v) Beta (6,, 6,): fo(x) = 0<2 <1, 6 = (6, 4), 


01, 0, > 0 
(vi) Negative binomial (r, 6): p,(z) = (@j)er"1 — 4), x=r,rt+l,..., 
0 < @ < 1,ra known positive integer 


(b) If R,,..., R, are independent, each R; having the density fo of part (a), 
find a sufficient statistic for (R,,..., Rp). 

6. Let 7 be sufficient for the family of densities fg, 6€ N. Consider the problem 
of testing the null hypothesis that 9€ Hy versus the alternative that 6 € H,. 
Show that all possible risk points can be obtained from tests based on T [i.e., 
p(x) expressible as a function of t(x)]. 


8.5 UNBIASED ESTIMATES BASED ON A COMPLETE 
SUFFICIENT STATISTIC 


In this section we require our estimates y of (0) to be unbiased; that is, 
E,w(R) = y(8) for all 6 € N. Our objective is to show that in a wide class of 
situations it is possible to construct unbiased estimates y that have uniformly 
minimum risk ; that is, if y" is any unbiased estimate of y(0), then p,() < 
Py(8) for all 6. We need a technical definition first. If T is a statistic for R, 
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T is said to be complete iff there are no unbiased estimates of 0 based on T, 
that is, iff whenever E,2¢(T) = 0 for all 6€ N, we have P,{g(T) = 0} = 1 
for all OE N. 


Theorem I. Let T = t (R) be a complete sufficient statistic for R, and let 
y be an unbiased estimate of y(0) based on T [i.e., p(x) can be expressed as a 
function of t(x)\. Assume that the loss function L(y(9), a) is convex in a for 
each fixed 0. Then p has uniformly minimum risk among all unbiased estimates 


of y(8). 


Proor. Let y’ be any unbiased estimate of y(0), and define y’(x) = 
E[y'(R) | T = t(x)]. (Since T is sufficient, y” does not depend on 6 and hence 
is a legitimate estimate.) wy” is an unbiased estimate of (6) based on T, and 
so is y, and therefore E,[y"(R) — y(R)] = 0 for all 6. But y’(R) and y(R) 
can be expressed as functions of 7; hence, by completeness, 


P,{y"(R) = y(R)} = 1 for all 6 


It follows that p,.(8) = p,(0) for all 6. But the proof of the Rao-Blackwell 
theorem, with R, replaced by T, x by t(x), p(Ri, Rz) by y’(R), and y*(R,) 
by p"(R), shows that p,-(8) < p,,(6) for all 6, as desired. 


If L(y(@), a) = (y(9) — a)’, then p,(6) = E,[(y(9) — p(R))"] = Var, y(R). 
Thus p has the smallest variance of all unbiased estimates of (0), regardless 
of the state of nature. In this case yp is said to be a uniformly minimum vari- 
ance unbiased estimate (UMVUE). 


> Example 1. Let R,,..., R, be independent, each Bernoulli with param- 
eter 06, O< 0<1. By Example 1, Section 8.4, T= R, +°---+R, is 
sufficient for (R,,..., R,); let us show that it is complete. 

Now T is binomial with parameters nm and 0; hence 


E,g(T) = Se(, jer - gy" 


=| 320(; For 5) |a- 9 if6<1- 


If E,g(T) = 0 for all 6 € [0, 1], then >” , g(k)(7)2* = 0 for all ze [0, 0); 
hence g(k) = O fork =0,1,...,n 

We now look for unbiased estimates of y(0) based on T. If p(x,,...,%,) = 
e(t(a,,...,%,)), t(t,...,%,) = % +°°* + 2,, Is such an estimate, the 
above argument shows that E,y(R,,...,R,) = Eeg(T) is a polynomial in 
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6 of degree <n. Thus y(6) must be of the form a + 4,0 +--+: + 4,6"; 
furthermore, an unbiased estimate of such an expression is easily found. If 
T) = 7(T—1)---(T—r4+1), 2 =(n—-1)::-(1—r+)), then 


TO) LS Meo orth) ot! gg gy 
Bl a |  Lon(n — iOS = hhGen 0) 
— ry __(™—r)! k—r n—k __ r = ged 
(Jae 
—= 9" 


Thus >”, a,[T/n™] is a UMVUE of y(6) = >”, 4,0"; in particular the 
sample mean T/n is a UMVUE of 6. < 


> Example2. Let R,,..., 8, be independent, each Poisson with parameter 
0. By Example 2, Section 8.4, T = R, +--+: + R, is sufficient for (R,,..., 
R,,); T is also complete. For T is Poisson with parameter 0; hence 


Eae(T) = > (ben 


If E,g(T) = 0 for all 6 > 0, then 


> errs ‘|= 0 forall@6>0 
k=0 


Since this is a power series in 0, we must have g = 0. 
If p(a,...,%,) = g(t(%, ..., %,)) is an unbiased estimate of y(6), then 


y(9) = Eey(Ri, Bs fog R,,) = Eo2(T) = “at g(k) —— a 


Thus (9) must be expressible as a power series in 0. If y(0) = >”, a,6*, 
then 
ee) (n6)* 


(Oe? = »~ 4,0" > 
j-0 = k=0 kK! 


foe) k n? 
=> c,0"  wherec, = > —a,_; 
k=0 i-0 i! 


But 


roa) k 
z=0 6k! 
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hence 


kle 
a(k) = —* 
n 


We conclude that 


ces i ae 
Ti 


isa UMVUE of ~—_ (6) = Da," 
i-oill n k=0 


For example, if y(0) = 6", r= 1,2,..., the UMVUE is 


T! 1 #T” 
Ge 


————_ — = ses the sample mean when r = 1) 
(T — r)!n" n" 
[In this particular case the above computation could have been avoided, 
since we know that E,(T) = (n6)" (Problem 8, Section 3.2). Since Tn" is 
an unbiased estimate of 6” based on TJ, it is a UMVUE.] 

As another example, a UMVUE of 1/(1 — 0) = 2, 6*,0 <0 <1, is 


PROBLEMS 


1. Find a UMVUE of e in Example 2. 


2. Let R,,..., R, be independent, each uniformly distributed between 0 and 
6 > 0. By Problem 1, Section 8.4, T = max R, is sufficient for (R,,..., Rn). 
(a) Show that T is complete. 

(b) Find a UMVUE of (6), assuming that y extends to a function with a 
continuous derivative on [0, ©), and 6"y(@) > 0 as 6 --0. [In part (a), 
use without proof the fact that if [° h(y) dy = 0 forall 6 > 0, then h(y) = 0 
except on a set of Lebesgue measure 0. Notice that if it is known that h is 
continuous, then h = 0 by the fundamental theorem of calculus.] 

3. Let R,,..., R, be independent, each normal with mean 6 and known variance 
oa, 

(a) Show that the sample mean R is a UMVUE of 6. 

(b) Show that (R)? — (c/n) isa UMVUE of 6”. 

[Use without proof the fact that if [°, h(y)e* dy =0 for all 6 > 0, then 

h(y) = 0 except on a set of Lebesgue measure 0.] 
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Let R,,..., R, be independent, with P{R; =k} =1/N, k =1,..., N; take 
G= NN S152 cee: 

(a) Show that max, <;<, R; is a complete sufficient statistic. 

(b) Find a UMVUE of y(N). 


. Let R have the negative binomial distribution: 


P{R =k} =(%Yprl —p)®*", k=rnr4+1,...,0<p<l 


r—1 


Take 6 = 1 —p,0 < @ <1. Show that y(0) has a UMVUE if and only if it is 
expressible as a power series in 6; find the form of the UMVUE. 


. Let 


e-8 = gk 
= =o CU = 1 Sconce 0 
P{R =k} ne k oy ,9 >0 

(This is the conditional probability function of a Poisson random variable R’, 

given that R’ > 1.) Ris clearly sufficient for itself, and is complete by an argu- 

ment similar to that of Example 2. 

(a) Find a UMVUE of e~®. 

(b) Show that (assuming quadratic loss function) the estimate y found in 
part (a) is inadmissible; that is, there is another estimate y’ such that 
py’(9) < py (9) for all 0, and py(9) < py (9) for some 6. This shows that 
unbiased estimates, while often easy to find, are not necessarily desirable. 

The following is another method for obtaining a UMVUE. Let R,,..., R, 


be independent, each Bernoulli with parameter 0,0 < 6 < 1, as in Example 1. 
Ifj =1,...,7, then 


| IL: = P{R, =: = R; = 1} = 64 


Thus R,R,--- R; is an unbiased estimate of 6’. But then y(k) = E[R,--- R; | 
>7, 8; =k} is an unbiased estimate of 6’ based on the complete sufficient 
statistic }”_, R;, so that y isa UMVUE. Compute » directly and show that the 
result agrees with Example 1. 


- Let R,,..., R, be independent, each Poisson with parameter 6 > 0. Show, 


using the analysis in Problem 7, that 
n k(k —1 
E(RRe > R; -*) = sant 

i=1 


2 > 


k =0,1,... 
n 


. Let R,,..., R, be independent, each uniformly distributed between 0 and 6; 


if T = max R,, then [(” + 1)/n]T isa UMVUE of 6 (see Problem 2). Compare 
the risk function E,[((1 + 1/n)T — 6)?] using [(@ + 1)/n]T with the risk 
function E,[((2/n) ye R; — 6)?] using the unbiased estimate (2/n) an R;. 
Let R,,...,R, be independent, each Bernoulli with parameter 6 € [0, 1]. 
Show that (assuming quadratic loss function) there is no best estimate of 6 
based on R,,..., R,; that is, there is no estimate y such that py(9) < py (8) 
for all 0 and all estimates y’ of 6. 
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11. If Eyy,(R) = E,y.(R) = 7(9) and y,, y, both minimize p,(9) = E,[(y(R) — 
y(6))"], 6 fixed, show that P,{y,(R) = y.(R)} = 1. Consequently, if y, and y, 
are UMVUEs of 7(6), then, for each 6, y,(R) = .(R) with probability 1. 


12. Let fg, 9 N = an open interval of reals, be a family of densities. Assume that 
Ofe(x)/a0 exists and is continuous everywhere, and that f © o Joe(x) de can be 
differentiated under the integral sign with respect to 0. 


(a) If R has density f4 when the state of nature is 0, show that 


Es E In FAR] = 0 


(b) If E,y(R) = (0) and (%, p@)f,@) dx can be differentiated under the 
integral sign with respect to 6, show that 


d 7) 
7, y(0) = Ea) wR 39 10 fu) 
(c) Under the assumptions of part (b), show that 


[(d/d6) y(6)F 
[E,(@ In f,(R)/26)"] 


if the denominator is >0. In particular, if /g(~) = /o(%,...,%,) = 
[[%1 46@,), then 


Var, v(R) = 


2 0 
EB, (5 In fu(®)) = Var, FY: In f,(R) 


0 
=n Var, 56 In A,(R;) 


| 2 
= nEy (= In ha(R) 


where R = (R,,..., R,). 
The above result is called the Cramer-Rao inequality (an analogous theorem 
may be proved with densities replaced by probability functions). If py is an 
estimate that satisfies the Cramer-Rao lower bound with equality for all 6, 
then y is a UMVUE of +(@). This idea may be used to give an alternative 
proof that the sample mean is a UMVUE of the true mean in the Bernoulli, 
Poisson, and normal cases (see Examples 1 and 2 and Problem 3 of this section). 


13. If R,,..., .R, are independent, each with mean mw and variance o”, and V2 
is the sample variance, show that V? is a biased estimate of o*; specifically, 


(n-1) , 
n 


oO 


E(v*) = 
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8.6 SAMPLING FROM A NORMAL POPULATION 


If R,,...,R, are independent, each normally distributed with mean u 
and variance o”, we have seen that (R, V*), where R is the sample mean 
(1/n) (Ri +++: + R,) and V? = (1/n) >%,(R; — R)? is the sample vari- 
ance, is a sufficient statistic for (R,,..., R,). R and V? have some special 
properties that are often useful. First, R is a sum of the independent normal 
random variables R,/n, each of which has mean p/n and variance o?/n?; 
hence R is normal with mean wu and variance o?/n. We now prove that R and 
V? are independent. 


Theorem 1. If R,,...,R, are independent, each normal (u, o*), the 
associated sample mean and variance are independent random variables. 


*PROOF. Define random variables W,,..., W,, by 


1 1 
a “= ned —R, 
Vn 


We = Co Ry + °° * + CopRy 


Wi 


W,, aa Cri Ry oe va? i CanRn 


where the c,; are chosen so as to make the transformation orthogonal. 


[This may be accomplished by extending the vector (1// n, sees LV n) to an 
orthonormal basis for E”.] The Jacobian J of the transformation is the 


determinant of the orthogonal matrix A = [c,,] (with c,; = 1 In, j= 


1,...,m), namely, +1. Thus (see Problem 12, Section 2.8) the density of 
(W,,..., W,) is given by 
“ nil iss een) 
f° (Yrs «+5 Yn) lJ 


reer | re 
= (270°) ie exp Ps ae — | 


where 


= Ant 


Ln Yn 


8.6 SAMPLING FROM A NORMAL POPULATION 275 


Since >”, #7 = >”, y2 by orthogonality, and >”, 2, = Vn, 


zi 1 a ” 
F*(Yas - + +s Yn) = (2770*)-"" exp & 72 (20 — 2u,/n yy + nu’) | 


20 
1 = ln us)” 1 Yi 
== ———— €X ex 
J/27 0 [-“ 20° aoe i ox, Jom o a & er) 
It follows that W,,..., W,, are independent, with W,,.... W,, each normal 


(0, o”) and W, normal (/ n jl, 07). But 


=> (R,— B= DRE -2RYR, + n(RY 


3 


R;? — n(R)* 


Since /n R = W,, it follows that R and V? are independent, completing the 
proof. 


The above argument also gives us the distribution of the sample variance. 


For 
pric) 


where the W,/o are independent, each normal (0, 1). Thus nV?/o? has the 
chi-square distribution with n — 1 degrees of freedom; that is, the density of 
nV?/o? is 


1 (n—3)/2,—a/2 
———__————_— 2 ee, x > 0 
2°° YAP (n ne 1)/2) = 
(see Problem 3, Section 5.2). 7 
Now since R is normal (, o?/n), Jn (R — pw)/o is normal (0, 1); hence 
P)—b< jp) < b| = F*(b) — F*(—b) 


Oo 
= 2F*(b) — 1 
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where F* is the normal (0, 1) distribution function. If b is chosen so that 
2F*(b) —1=1— a, that is, F*(b) = 1 — «/2 (6 = N,,, in the terminology 
of Example 3, Section 8.2), 


or 


o[R- <p <R+ et} a1 a 
Jn 


pa [aes ny May 


I is called a confidence interval for u with confidence coefficient 1 — «. 

The interval I is computable from the given observations of R,,..., R,, 
provided that o? is known. If o? is unknown, it is natural to replace the true 
variance o” by the sample variance V*. However, we then must know some- 
thing about the random variable (R — y)/V. In order to provide the neces- 
sary information, we do the following computation. 

Let R, be normal (0, 1), and let R, have the chi square distribution with m 
degrees of freedom; assume that R, and R, are independent. We compute 


the density of Jm R,|V R,, as follows. 


Let 
J/m R, 
ee — 
VRe 
W, = R, 
so that 
R, = Wan W 
<. m 


Thus we have a transformation of the form 


/m fy “) 


(Yr, Yo) = (4, LZ) = ( 


with inverse given by 
Yw Y 
(2X1, %2) = h(i, Yo) = ( ate Y 


g is defined on {(%,, %) € E*: 2, > 0} and A on {(y;, Yo) € E?: Yo > O}. 
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By Problem 12, Section 2.8, the density of (W,, W,) is given by 
ST(Yi> Yo) = f(A, Y2)) In. Yad Y2 > 0 


where 
OY, OY /m 2,/Mye Yo 
JAY. Yo) = as ae el 
au, am| | , | Ym 
OY, OY. 
Thus 


| 1 2 1 a Ye 
* , Yo) = ——e Yy-yg/2m _* gm /2— 1, nn fe 
FixlYr Ye) Jim 27m ]2) Yo ie 


Therefore the density of W, is 


fog) = { SEs Ys) dye 
But 
oa) (m+1)/2 
| yr D/A (Lyn) m)v2l day = (im 36 1)/2)2 Bs 


(1+ y/mynrn? 
Hence 


Im + 1/2) ! 

Jina 1(m]2) (1 + yim? 

A random variable with this density is said to have the ¢ distribution with m 
degrees of freedom. 


An application of Stirling’s formula shows that the ¢ density approaches 
the normal (0, 1) density as m — oo. 


Now we know that Jn (R — p)/o is normal (0, 1), and nV?/o* has the 
chi-square distribution with n — 1 degrees of freedom. Thus 


Jn=lyn(Ro-wle_ -7~7RowM_ RO 
Jn Vio V : eo | 
(a/c - I) YR - oa 


has the ¢ distribution with n — 1 degrees of freedom. 
If ty » is such that Se, , Am(t) dt = B, where h,, is the ¢ density with m 


degrees of freedom, then 


P\—tan n—1 < Jn—1 {R= V M) < ian] — 1 — Co 


[R _ Vityje.n—1 R+ V ta/e, n— | 
Jn—1- Jn—1 


is a confidence interval for w with confidence coefficient 1 — «. 


i w Yr) = 


Thus 
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PROBLEMS 


L 


Let R, and R, be independent, chi-square random variables with m and n 
degrees of freedom, respectively. Show that (R,/m)/(R,/n) has density 
(m|n)™/2 ap(m/2)—1 
Tin) =a re ss a eA 
B(m/2, n[2) (1 + men) 
(R,/m)/(R,/n) is said to have the F distribution with m and n degrees of freedom, 
abbreviated F(m, n). 


x>0 


. Calculate the mean and variance of the chi-square, ¢, and F distributions. 
. (a) If T has the ¢ distribution with n degrees of freedom, show that T* has the 


F(1, n) distribution. 
(b) If R has the F(m, n) distribution, show that 1/R has the F(n, m) distribution. 
(c) If R, is chi-square (m) and R, is chi-square (n), show that R, + R, is chi- 
Square (m + n). 


. Discuss the problem of obtaining confidence intervals for the variance o” of a 


normally distributed random variable, assuming that 
(a) The mean v is known 
(b) “is unknown 


. (A two-sample problem) Let Ry, Ris...» Rin,» Rei, Rez, .-.» Ron, be inde- 


pendent, with the R,; normal (u,, o”) and the R,, normal (u., 6”) (u,, Hu, and o? 
unknown). Thus we are taking independent samples from two different normal 
populations. Show that if R,; and V,?,i = 1, 2, are the sample mean and variance 
of the two samples, and 
n, + No 1/2 
k = (nV? + noV2)'/2 | ——_+——*-—_ 
ars bas) Feces =D) 

then [R, — R, — kt,,,, nytng-2» Ry — Re + ktyjo n,4n,-2] is a confidence interval 
for “4, — sy with confidence coefficient 1 — «a. 


. In Problem 5, assume that the samples have different variances o,” and o,?. 


Discuss the problem of obtaining confidence intervals for the ratio o,?/0,”. 


. (a) Suppose that C(R) is a confidence set for (6) with confidence coefficient 


>1 — a; that is, 
P,{y() € C(R)} > 1 —«@ for all OE N 

Consider the hypothesis-testing problem 

Ay: (0) = k 

A,: y(%) #k 
and the following test. 

y,(x) = 1 if k € C(x) 

= 0 if k € C(x) 

[Thus C(x) is the acceptance region of 9,.] Show that ¢, is a test at level «. 
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(b) Suppose that for all k in the range of y there is a nonrandomized test 9; 
fi.e., y,(%) = Oor 1 for all x] at level « for Hy: y(@) = k versus H,: y(0) # k. 
Let C(x) be the set {k: ,(x) = 0}. Show that C(R) is a confidence set for 
y(6) with confidence coefficient >1 — «. 

This result allows the confidence interval examples in this section to be translated 

into the language of hypothesis testing. 


*8.7 THE MULTIDIMENSIONAL GAUSSIAN DISTRIBUTION 


If R|,..., R’, are independent, normally distributed random variables and we 
define random variables R,,..., R, by R; = >*_, aR, + 5, i=1,...,n, 
the R; have a distribution of considerable importance in many aspects of 
probability and statistics. In this section we examine the properties of this 
distribution and make an application to the problem of prediction. 


Let R= (R,,..., R,) be a random vector. The characteristic function 
of R (or the joint characteristic function of R,,..., R,,) is defined by 
M(uy,...,uU,) = Efi(u,R,; +--+ +4,R,)], Uy,...,U, real 


-| fa | exp (132) AF (#1, ... 5 Um) 
nies Los k=1 


where F is the distribution function of R. It will be convenient to use a vector- 
matrix notation. If u = (w4,...,u,)€ E”, u will denote the column vector 


with components u,,...,u,. Similarly we write x for col (%,...,,) and 
R for col (R,,...,R,). A superscript ¢ will indicate the transpose of a 
matrix 


Just as in one dimension, it can be shown that the characteristic function 
determines the distribution function uniquely. 


DEFINITION. The random vector R = (R,,...,,) is said to be Gaussian 
(or R,,..., R, are said to be jointly Gaussian) iff the characteristic 
function of R is 


M(u,,...,U,) = exp [iu’b] exp [—4$u'Ku] 
= exp id u,b, 1a > uKyas| (8.7.1) 
r=1 2 r,s=1 
where b,,..., b,, are arbitrary real numbers and XK is an arbitrary real 


symmetric nonnegative definite n by n matrix. (Nonnegative definite 
means that >” ,_, a,K,,a, is real and >0 for all real numbers a,,..., 
a,,-) 
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We must show that there is a random vector with this characteristic 
function. We shall do this in the proof of the next theorem. 


Theorem 1. Let R be a random n-vector. R is Gaussian iff R can be 
expressed as WR’ + b, where b = (bh,,...,5,) € E”, W is ann by n matrix, 


and Rj, ..., -R,, are independent normal random variables with 0 mean. 
The matrix K of (8.7.1) is given by WDWt, where D = diag (A,,... 5 An) 
is a diagonal matrix with entries 4, = Var R;, j=1,...,n. (To avoid 


having to treat the case 1; = 0 separately, we agree that normal with expecta- 
tion m and variance 0 will mean degenerate at m.) 
Furthermore, the matrix W can be taken as orthogonal. 


Proor. If R = WR’ + b, then 


Elexp (iu’R)] = exp [iu’b] E[exp (u’WR’)] 
But 


n 


E[exp (iv’R’)] = E exp (ini | 


k= 


n Le. 
= || Elexp (iv,R,)] = exp | - = > Ai 
k=l Oe at 


Set v = Wu to obtain 
E[exp (iuéR)] = exp [iu’b — jutKu] 


where K = WDW*‘. Kis clearly symmetric, and is also nonnegative definite, 
since u'Ku = v'Dv = >”_, A,v,2 > 0, where v = Wu. Thus R is Gaussian. 
(Notice also that if K is symmetric and nonnegative definite, there is an 
orthogonal matrix W such that W‘KW = D, where D is the diagonal matrix 
of eigenvalues of K. Thus K = WDW*, so that it is always possible to con- 
struct a Gaussian random vector corresponding to a prescribed K and D.) 

Conversely, let R have characteristic function exp [iu‘b — (1/2)(u’Ku)], 
where K is symmetric and nonnegative definite. Let W be an orthogonal 
matrix such that WtKW = D = diag (A,,...,4,), where the A; are the 
eigenvalues of K. Let R' = W*(R — b). Then 


E[exp (iu’'R’)] = exp (—iu’W'b)E[exp (1u°W'‘R)] 


—exp[—iv'Kv] where v=Wu 


= exp [—4u’Du] = exp G : y i | 
2k=1 
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It follows that R,,...,R’ are independent, with R; normal (0, A,). Since 
W is orthogonal, Wt = W-!; hence R = WR’ + b. 


The matrix K has probabilistic significance, as follows. 


Theorem 2. In Theorem I we have E(R) = b, that is, E(R;) = b;, j = 
I,...,n,and K is the covariance matrix of the R;, that is, K,, = Cov (R,, R,),?; 
a eareraee (2 


Proor. Since the R; have finite second moments, so do the R,. E(R) = b 
follows immediately by linearity of the expectation. Now the covariance 
matrix of the R; is 


[Cov (R,, R,)] = [E(R, — 5,)(R, — 4,))] = ELCR — b)(R — by)‘ 


where E(A) for a matrix A means the matrix [E(A,,)]. Thus the covariance 
matrix is 
E[WR'(WR’)')] = WE(R'R')W' = WOW*=K 


since D is the covariance matrix of the Rj. 


The representation of Theorem 1 yields many useful properties of Gaussian 
vectors. 


Theorem 3. Let R be Gaussian with representation R = WR' + b, W 
orthogonal, as in Theorem 1. 

1. If K is nonsingular, then the random variables R; = R,; — b, are linearly 
independent; that is, if >” ,a;R* = 0 with probability 1, then all a, = 0. 
In this case R has a density given by 


Sf (®) = Q2)-""*(det K)“ exp [—4(x — b)'K-*(x — b)] 


2. If K is singular, the R= are linearly dependent. If, say, VR sisson 
is a maximal linearly independent subset of {R¥,..., R*}, then (R,,...,R,) 
has a density of the above form, with K replaced by K, = the first r rows and 
columns of K. R*,,,..., .R* can be expressed (with probability 1) as linear 
combinations of R¥*,..., R*. 

PROOF. 

1. If K is nonsingular, all A; are > 0; hence R’ has density 

; —n/274 —1/2 1 Ye 
F() = ayy Ay exp | — 3 > | 
2 %=1 A, 


= (27) "(det K)-"”* exp [—3y'D-'y] 
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The Jacobian of the transformation x = Wy + b is det W = +1; hence R 
has density 


f(x) = f'(W'(x — b)) 
= (27)-"/? (det K)-!/? exp [—4(x — b)'WD4W*(x — b)] 


Since K = WDW', we have K"! = WD™"W*, which yields the desired ex- 
pression for the density. 
Now if >7_, a;R* = 0 with probability 1 


> 4;RF 


= E| 
j=1 


Since K is nonsingular, it is positive rather than merely nonnegative definite, 
and thus all a, = 0. 

2. If K is singular, then >", _4,K,s4; Will be 0 for some a,...,a,, not 
all 0. (This follows since u'Ku = ytd xz", Where v = Wu; if K is singular, 
ane some A, is 0.) But by the analysis of case 1, E[|>%_, a,R*|?] = 0; hence 

jo a,;R* =0 with probability 1, proving linear dependence. The re- 
maining statements of 2 follow from I, 


‘|= y a,E(RR>)a, 


r,s=1 


= y a,K, as 


r,s=1 


REMARK. The result that K is singular iff the R¥ are linearly dependent is 
true for arbitrary random variables with finite second moments, as 
the above argument shows. 


p> Example 1. Let (R,, R,) be Gaussian. Then 


2 
0," O12 
K= 
2 
O12 Og 


where o,? = Var R,, o,2 = Var Ro, O12 = Cov (R,, R,). Also, det K = 
01"0,"(1 — pio”), where py. is the correlation coefficient between R, and Rg. 
Thus K is singular iff |p,,| = 1. In the nonsingular case we have 


0,7 —o 
K™ = (0;?02?(1 — py2”))* "| 


— 2 
O12 O71 
and the density is given by 


1 
2776,0,(1 — Pio)” 
Oo(x or a)” — 261% — aly — b) + oxy = =] 
20,’0,"(1 = pis’) 


f(%, y) = 


x exp = 
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where a = E(R,), b = E(R,). The characteristic function of (R,, R,) is 
M(uy,, v2) = exp [i(au, + bua)l exp [—% (12m? + 20y2ty ty + G27up”)). 
Notice that if nm = 1, the multidimensional Gaussian distribution reduces to 
the ordinary Gaussian distribution. For in this case we have 


K= [o], M(u) — eiume—wia"/2 f (x) — I o(a—m)*/26" 


Jima 


where m = E(R). < 


Theorem 4. If R, is a Gaussian n-vector and R, = AR,, where A is anm 
by n matrix, then R, is a Gaussian m-vector. 


Proor. Let R, = WR’ +) as in Theorem 1. Then R, = AWR’ + Ab, 
and hence R, is Gaussian by Theorem 1. 


COROLLARY. 
(a) If Ry,..., R, are jointly Gaussian, so are Ry,..., Ry, m <n. 
(b) If R,,...,R, are jointly Gaussian, then aR, +--:+a,R, is a 


Gaussian random variable. 


Proor. For (a) take 4 = [J 0], where J is an m by m identity matrix. 
For (b) take A = [aqa,...a,]. 


Thus we see that if R,,...,R, are jointly Gaussian, then the R, are 
(individually) Gaussian. The converse is not true, however. It is possible to 
find Gaussian random variables R,, R, such that (R,, R.) is not Gaussian, 
and in addition R, + R, is not Gaussian. 

For example, let R, be normal (0, 1) and define R, as follows. Let Rg 
be independent of R,, with P{R, = 0} = P{R, = 1} = 1/2. If Rz = 0, let 
R, = R,; if Rs = 1, let R, = —R,. Then P{R, < y} = (1/2)P{R, < yy + 
(1/2) P{—R, < y} = P{R, < y}, so that R, is normal (0, 1). But if R, = 0, 
then R, + R,=2R,, and if R,=1, then R,+ R,=0. Therefore 
P{R, + R, = 0} = 1/2; hence R, + R, is not Gaussian. By corollary (b) 
to Theorem 4, (R,, R,) is not Gaussian. 

Notice that if R,,..., R, are independent and each R, is Gaussian, then 
the R, are jointly Gaussian (with K = the diagonal matrix of variances of the 
R,). 


Theorem 5. If R,,..., R,, are jointly Gaussian and uncorrelated, that is, 
if K,; = 0 for i ¥ j, they are independent. 
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Proor. Let o,2 = Var R;. We may assume all o,? > 0; if o,? = 0, then 
R; is constant with probability 1 and may be deleted. Now K = diag 
(o,7,..., 06,7); hence K-1 = diag (1/0,?,..., 1/o,,2), and so, by Theorem 3, 
R,,..., R, have a joint density given by 


f(y, 0. . 5 Gm) = (20)~"(0y + + + Gp) exp bas ; s (x; mel | 


j=1 OO, 


Thus R,,..., R, are independent, with R; normal (6,, 0,7). 


We now consider the following prediction problem. Let R,,..., Ry. be 
jointly Gaussian. We observe R, = 2,,..., R, = ~, and then try to predict 
the value of R,,,. If the predicted value is p(7,...,2%,) and the actual 
value is x,,,;, we assume a quadratic loss (#,,1 — p(a%,..., %,))®. In other 
words, we are trying to minimize the mean square difference between the 
true value and the predicted value of R,,,,. This is simply a problem of Bayes 
estimation with quadratic loss function, as considered in Section 8.3; in 
this case R,,, plays the role of the state of nature and (R,,...,R,) the 
observable. It follows that the best estimate is 


w(%,.-., %,) = E(Rau | Ri = 4... , Ry = Xy) 


We now show that in the jointly Gaussian case p is a linear function of 
%,...,%,. Thus the optimum predictor assumes a particularly simple 
form. 

Say {R,, ..., R,}isa maximal linearly independent subset of {R,,... , R,}. 
If R,,...,R,, Ry, are linearly dependent, there is nothing to prove; if 


R,,...,R,, Ry are linearly independent, we may replace R,,..., R, by 
R,,..., R, in the problem. Thus we may as well assume R,,..., Radi 
linearly independent. Then (R,,...,R,.1) has a density, and the con- 
ditional density of R,,, given R, = %,...,R, = 2, is 

Di scisk og 
A(%y44 | 02 Xn) = I * + 


[ Ce 


n+1 


(2n)-™"O (det KY! exp | >» des 
1 


T,s= 


= oe) n+1 
(2a) ™*)/*(det Ky exp me ; > me AXLns1 


—0o 7,s=1 


where K is the covariance matrix of R,,...,R,., and Q = [q,,.] = K-11. 
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Thus 
h(@n41 | 1, eee 9 2) = a 
B(%4,..~ 5 Bp) 
1 n n 
xX €xp i I > Lr rsXs a Dic > Ohi ee 
2 T,s=1 s=1 
+ Unt 2D % Irn r dniansx%eis) 
Al His. 858 
a A Pa) ory (Cabs + Ditays)) 
BE & 25 50,) 
where ; 


C = $0 nsi1.nt19 D = D(x, ce 8 9 Ln) = DI nitarr 
Therefore the conditional density can be expressed as 
A D? DY 
— exp | —| exp] —C{2z,., + — 
3 ac |? | -o( + 3c) | 


Thus, given R; = %,,...,R, = x,, R,,, is normal with mean — D/2C and 
variance 1/2C = 1/4n41 n41- Hence 


1 n 
> Gn+i,r%r 


Gn+1,n4+1 7=1 


E(Rayi | Ri = %,..., Ry = 2) = — 


Tables 


Common Density Functions and Their Properties 


Type 


Uniform on [a, b] 


Normal 


Gamma 


Beta 


Exponential 
(= gamma with 
a=1,f8 =1/A) 


Chi-square 
(= gamma with 
a = n/2, B = 2) 


g(a) #202 
V 220 


Tee 72" 


at (1 —_ a)s—t 


BGS) vier 


1 


or (/2)—1 — 2/2 > 0 
MBDA °° 


l[@ + 1/2] ae 
Van V(n/2) ad + a2 [y)(n+1)/2 


(m[ny™/2 ap (m/2)—1 
BQm/2, nf2) 1 + (mj[njxyor 2” 


0 
ar (ac? + 6?) 


a,brealla <b 


real, o > 0 


“,B > 0 


r,s >0 


A>0 


m,n =1,2,... 


6>0 


Common Density Functions (continued) 


Generalized Characteristic 
Function (If Easily 
Type Mean Variance Computable) 


Uniform on a+b (b — a) 1 e784 — 98d : 
[a, b] 2 12 b —aQ 5 9 all s 


=| | = [eam 
s + 1/6)’ 
r rs 
Exponential é Res > —A 
A A? s +A? 
Chi-square 2n (2s + 1)-”/2, 
n 


t 0 if m > 1; does 


not exist ifm = 1 


2n?(m +n — 2) 
m(n — 2)*(n — 4) 
n>4; oifn =3 
or 4 


Cauchy Does not exist Does not exist elul s = iu, ureal 


Common Probability Functions and Their Properties 


Type Probability p(k) Parameters 


Discrete uniform 


ar) 


Bernoulli 

Binomial 

Poisson e AA IKI, k= 02D sc A>0 

Geometric gp, =1,2,... 0<p<li,q=1-p 


Negative binomial O<p<i,q=1-p, 


pS 1 Je ey 


(Dptg?* = Gr, )p"(-9)", 
KS loses 
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Common Probability Functions (continued) 


Generalized Characteristic 
Type Mean Variance Function 


oe f | N+1 N? —1 e §(1 — e SN) 
iscrete uniform a 12 Wd —e) 


Bernoulli p(l — p) q+pes, alls 
Binomial np(1 — p) (g + pe)", alls 


: all s 


1 1-—p pes 
Geometric ~ 5 =. lge*| <1 
Pp P 1 — ge 
r(1 — p) * es \r 
Negative binomial ~ ( 5 P) ( P = g lge| <1 
Pp 1 — ge 
Selected Values of the Standard Normal Distribution Function 
F@) =Qay%? feed,  F(-2) = 1 — F@) 
x F(2) x F(«) x F(«) x F(#) 
0 .500 9 .816 1.64 .950 2.33 .990 
A 540 1.0 .841 1.7 955 2.4 992 
2 579 1.1 .864 1.8 .964 2.5 .994 
3 .618 1.2 .885 1.9 971 2.6 995 
4 .655 1.28 .900 1.96 975 2.7 .996 
5 .691 1.3 .903 2.0 977 2.8 997 
6 .726 1.4 .919 2.1 982 2.9 .998 
7 .758 1.5 .933 22 .986 3.0 .999 
8 .788 1.6 945 2.3 .989 
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Mathematical Statistics by R. Hogg and A. Craig (Macmillan, 1965). A 
more advanced work emphasizing the decision-theory point of view is 
Mathematical Statistics, A Decision Theoretic Approach by T. Ferguson 
(Academic Press, 1967). 

The student who wishes to take more advanced work in probability will 
need a course in measure theory. H. Royden’s Real Analysis (Macmillan, 
1963) is a popular text for such a course. For those with a measure theory 
background, J. Lamperti’s Probability (W. A. Benjamin, 1966) gives the 
flavor of modern probability theory in a relatively light and informal way. 
A more systematic account is given in Probability by L. Breiman (Addison- 
Wesley, 1968). 
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Solutions to Problems 


CHAPTER 1 


Section 1.2 


1. ANBOC = {4,4 U(BOC = {0, 1, 2, 3, 4, 5, 7}, 
(A UB) AC = {0, 1,3, 5,7}, (ANB) A[AUCY] =2 


3. W = registered Whigs 
F = those who approve of Fillmore 
E = those who favor the electoral college 


x+y +2 +100 = 550 
ety +25 = 325 


Thus z + 100 — 25 = 550 — 325 = 225, soz = 150 


PROBLEM 1.2.3 
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_If a<«<b then x <b —1/n for some n, hence cEU™, Gb — I/n). 
Conversely, if ze UY” ., (a, b — 1/n], then a < x < b — 1/n for some x, hence 
x € (a, b). The other arguments are similar. 


9.<E€AO(U; B) iff ee A and x€ B; for at least one i 


iff <€ A O B; for at least one i 


Section 1.3 
1. (a) Let A < Q, and let F consist of @, Q, A and A*. 


(b) Let A,, A,, ... be disjoint sets whose union is 2. Let ¥ consist of all finite 
or countable unions of the A; (including @ and Q). 


Section 1.4 


1. 
2s 


eo co XN N 


432 


(a) The sequence of face values may be chosen in 9 ways (the high card may be 
anything from an ace to a six); the suit of each card in the straight may be 
chosen in 4 ways. Thus the probability of a straight is 9(4°)/(52). 


(b) Select the face value to appear three times (13 choices); select the two other 
face values [(12) choices]. Select three suits out of four for the face value 
appearing three times, and select one suit for each of the two odd cards. 
Thus p = 13(2)@)(16)/(®2). 

(c) Select two face values for the pairs [(13) possibilities]; then choose two 
suits out of four for each of the pairs. Finally, choose the odd card from 
the 44 cards that remain when the two face values are removed from the 
deck. Thus p = (13)()@)(44)/@). 


. 149/432 
. (a) (52)(48) - - - (20)(16)/521° 


(b) [443)39 + 448)1/G3 


. We must have exactly one number > 8 (2 choices) and exactly 3 numbers < 8 


[@) possibilities]. Thus p = 2(2)/(2). 


-(m + D/C) 

1 = @ayage 

- BO® -— BQO 

. Let A; be the event that the ticket numbered i appears at the ith drawing. The 


probability of at least one match is P(A, U +--+: U Ay). Now P(A,) = (# — 1)!/ 
n! = 1/n (the first ticket must have number 1, the second may be any one of 
n — 1 remaining possibilities, the third one of n — 2, etc.) By symmetry, 
P(A,;) =1/n for all i. Similarly, P(A; 0 A;) = (@@ — 2)!/n! = 1f/nt — 1), 
i<j, P(A; A Aj O Ay) = (v2 — 3)'/n! = 1fn@a — DG —- 2), i<j <k, ete. 
By the expansion formula (1.4.5) for the probability of a union, P(A; U::- VU 
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A,) = n/n) — @)@ — 2)'/n! + @a — 3)nt! —--- + (-1)" 710! an! = 
1 — 1/2! +1/3! —--- + (-1)"s/n1. 

11. (365),/365" | 


12. The number of arrangements with occupancy numbers r; = rp = 4, rg =", = 
rs = 2,rg = 0, is CHC) EQ) = 141/41 4! 2! 2! 2! 0!. We must select 2 boxes 
out of 6 to receive 4 balls, then 3 boxes of the remaining 4 to receive 2; the 
remaining box receives 0. This can be done in (8)(3)G) = 6!/2!3! 1! ways. 
Thus the total number is 14! 6!/(4!)7(2!)43!. 


Section 1.5 
4. (a) This is a multinomial problem. The probability is 


30! 50 \10 te 10 / 29 \10 
10! 10! 10! \100 100) \100 


(b) This is a binomial problem. The probability is 


30\ / 30 \ / 70 \'8 
12)\100) \100 
5. The probability that there will be exactly 3 ones and no two (and 3 from {3, 4, 
5, 6}) is, by the multinomial formula, (6!/3! 0! 3\(4)?@)°@)*. The probability of 


exactly 4 ones and 1 two is (6!/4! 1! 1!)\()*@)'($)". The sum of these expressions 
is the desired probability. 


Section 1.6 
6 

1. (7p'g® + 6p>q® + Sp%qt) | > o)pkqie-* 
x=4 

2 1 —q" — mpq"* — (ep qr 


1 — q" 


3. We have the following tree diagram: 


n heads 


a)" 


Less than n heads 


n heads 


1—p Less than n heads 


PROBLEM 1.6.3 


SOLUTIONS TO PROBLEMS 293 


The desired probability is 


P{unbiased coin used and n heads obtained} 
7 P{n heads obtained} 


_ (1 pyre 
(1/2 + (1/2)p” 


4. Let A = {heads}. By the theorem of total probability, 


n=1 


P(A) = >) P{I = n}P(A|I =n) = >; (1/2)"e-” 
n=1 


= (1/2)e7/(1 — 3e) 


REMARK. To formalize this problem, we may take Q = all pairs (m, i), n = 1, 
2,...,2 = 0,1 [where (m, 1) indicates that J = n and the coin comes up 
heads, and (n, 0) indicates that J = n and the coin comes up tails]. 

We assign p(n,i) = (1/2)"e" if i= 1, and (1/2)"1 —e™”) if i =0. 
Alternatively, a tree diagram can be constructed; a typical path is indicated 
in the diagram. 


PROBLEM 1.6.4 


5. (a) ($)(29)/(28) = .36 
(b) [G)G9) + G@)1/@8) = 2(8)29)/@8) = .48 
(c) Aha = 15 
(d) 2(29)/G8) = 
7. By the theorem of total probability, if X,, is the result of the toss at t = n, then 
(see diagram) 
Uni = PiXny, = H} = P{X, = AYP{Xn yy = |X, = H} 
Thus Yn.1 + (1/4)y¥, = 3/4. The solution to the homogeneous equation ¥,,, + 
(1/4)y, = 0 is y, = A(—1/4)”. Since the “forcing function’’ 3/4 is constant, we 
assume as a “particular solution” y, = c. Then (Sc)/4 = 3/4, or c = 3/5. Thus 


the general solution is y, = A(—1/4)” + 3/5. Since y is given as 1/2, we have 
A = 1/2 — 3/5 = —1/10. Thus y, = 3/5 — 1/10(—1/4)". 
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Nie 


Yn Yn+1 


Nl 
b 


1- Yn os Yn+1 
PROBLEM 1.6.7 


9. We have (see diagram) P(D | as P(D CO R)/P(R) = .2(.9)(.25)/.2(.9)(.25) + 
8(.3)(.25) = 3/7 


P=positive 
N=negative 
R=rash 


PROBLEM 1.6.9 


CHAPTER 2 


Section 2.2 
1. (a) {@: R(w)€ EY} = QE*F, hence E'e@. If B,, By,...€%, then {w: R(w) € 
Ur, Bn} = UX, to: R(w) € B,} © F since each B,E@; thus YU, B, € 
@. If BE@, then {w: R(w) € B°} = {w: R(w) € B} EF, hence B° EF. 
(b) © is a sigma field containing the intervals, hence is at least as large as the 
smallest sigma-field containing the intervals. Thus all Borel sets belong to @. 


Section 2.3 
(b) 1. P{[R| < 2; =P{-2<R< 2} = f2sfr@® dx = F,(2) — Fp(—2) = 


1-—e”°? 
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. P{\R| <2 or R > 0} = P{R > —2} =1 — Fp(-2) =1 — fe 
. Pi|R| <2and R < -1} =P{-2 <R< —1l} =3(e — ee) 
PUR) +|R—3| <3} =P <R <3} = — ec) 


_ PLR? — RP? — R—2 <0} =P{(R—-2)(R24+R +1) <0} = PR <2} 
(since R® + R +1 is always > 0) =1 — $e 


"nn — es) No 


6. P{esinnR > 1} = P{sin rR > 0} = P{0 < R <1} + P{2 <R <3} 
+P{4<oR<5}4--- 
+P{-2 <R<-14+P{-4<R<-3}4+--: 


But P{2n —1 < R < 2n} = P{—-2n < R < —2n + 1} since fp is an even 
function. Thus 


P{esinnR > 1} =} PfQn < R<2n+1} +> PQn -1 <R< 27} 


n=0 n=1 


=P{R>0} =3 


7. P{R irrational} = 1 — P{R rational}. The rationals are a countable set, 
say {%,,%,...}. Hence, P{Rrational} = $%, P{R = 2,} =0, so P{R 
irrational} = 1. 
2. By direct enumeration, 
P{R = 05 = p? + pig + pg? + Pg? + PP +7 
PUR = 1} = 4piq + 6p'q® + 6p*g® + 4pq" 
P{R = 2} = 3p3q? + 3p’q¢° 
Section 2.4 
Lifo<y <1, 
1 u oo 
F,(y) = P{R, < y} + PIR, 2 | -| e* dx +{ e” dx 
0 1 


ly 


=l-e%+el 
Foy) =0,y4 <0, FM=ly21 
Fy is the integral of 


d | 1/ 
La eae rs “4 O<y <i 


= 0 elsewhere 


Hence R, is absolutely continuous. 
im 


1 
fl) = ze" <Y <e 


= 0 elsewhere 
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3. 
foly) = 2y7,2<y <4 
=3y°",y>4 
=O,y <2 


y/2 2 
4.12 <y <4, FW) = PLR <¥} =| goo Ss 
=e 


Y 
If4<y <5, Fy) =4 
Ify 25, Foy) = 1 
P{R, = 5S} = P{R, > 2} =3, so R, is not absolutely continuous. 


7. Consider the graph of y = g(x). The horizontal line at height y will intersect 


hy(y) hay) 


PROBLEM 2.4.7 


the graph at points (say) z,,,...,%;,, with x; € I,. If we choose a sufficiently small 
interval (a, b) about y, we have (except for finitely many y, which lead to inter- 
sections at the endpoints of intervals) 
n 
F,(b) — F,(a) = P {a < Rp <b} => Pie; < R, < d;} 


j=1 

where c; = h,(a) and d; = h,(b) if jE {i,,..., i,} and A, is increasing at 2, 
c; =h,(b) and d; =h,(a) if jE {i,,..., i,$ and h, is decreasing at x, 
c; = (say) | andd, = Oif 7 {i,,..., i}. Thus 


n aj n 

F,(b) — F(a) = | fi@)4e => (FG) — A] 
jHl1eej j=l 

where F,(d;) — F,(c;) is interpreted as 0 if c; > d;. Differentiate with respect to 

b to obtain f(b) = >, filh;(b)] |h;(D)|, as desired. 


Section 2.5 


1. (a) 
(b) 
(c) $ + FpU.5) — Fp(.5) = 4 +4 — 44) =7/12 
(d) FpG) — Fr65) = 3 - 7, = 7 


, 


Colt Colt 
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Section 2.6 
4. (a) 23 (d) 8 
(b) 23 (e) 4 
() 5 — (f) 4 
(g) 1 — det 


In each case, the probability is ¢ (shaded area). 


xty=3 


D> 


PROBLEM 2.6.4 


Section 2.7 
1. (a) fp@) = 6x? — 40200 <a <1), 


fpy) =2y0 <y < J) 
(b) 3/8 
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2. f,(%) = 2e* — 2e**, x > 0 
fay) = 2e-*7, y > 0 
3. 


f,(y) 


PROBLEM 2.7.3 


1 1-2” 31 
8. (a) P{R)’ + Ro < I} -| | a(u@ + y) dy dx =—_ 
=o Jy=0 480 


(b) Let A ={R, <1,R, <1}, B={R, <1, R,>13, C={R, >1,R, <3} 


PROBLEM 2.7.8 


The probability that at least one of the random variables is < 1 is 
P(AUBUC)=P(A) + PB)+ PC) =F +e4+2=8 


The probability that exactly one of the random variablesis <1 isP(B UC) = 

P(B)+ P(C) =4. If D = {exactly one random variable < 1} and E = 

{at least one random variable < 1}, then 
P(DOE) PD) 1/2 


PDI E) =p = Pay ~ 5/8 


4 

5 
1 2 

( Notice that, for example, P(B) = | { 4(x + y) dy dx =}, ec. 
=0 Jy=1 


(c) P{R, <1, Rp <1} =4 ¥ P{R, < 1P{R, < 1} = 3/8)”, hence the random 
variables are not independent. 
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Section 2.8 
l.fp2) =%,0<2< 1; fs(2) = 92 8/2, z>1;f;() =0,2 <0 
2. (a) fs(2) = ze*, z > 0; fa(z) = 0,2 < 0 
1 


OO =e ane > 0, fa() = 0,2 <0 


3. fO =m alle 
4. f3(2) =5.2 >1,fs@) =0,2 <1 
5. 3V3/8 

6. 2/3 

7. 8/27 

9. 2/33 


10. Let R, = arrival time of the man, R, = arrival time of the woman. Then the 
probability that they will meet is 
shaded area 


Pi\R, — Ro| < 2} = —————_ = 1 - (1 —z? =22 —- 
4 al $4 total area ( :) mA 2) 


PROBLEM 2.8.10 


(a —1)d]” . 
ee if(n —l)d<L,and0if(@—ld>L 


1 
13. Sr) = pare, r >0; fo, ieee 0 <6 <2rn. 
Section 2.9 
2. (a) n > 4600 


(b) 1 — Se 
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3. (a) No. 


(b). 9 


k 
4. P{R, + Ry =k} = > P{R, =i, Rg =k — 33 


~=0 


ae 


. e Lan—t Me, k—-tip_m—k+t 
Pa “lpougle 
i=0 


re ae 6 m 
= pigrtm k 2 (7) (. _ i) But 


@ > (7) (.) : (’ "). 


[We may select k positions out of m + min (oem) ways. The number of selections 
in which exactly 7 positions are chosen from the first 1 is (”)(,™,). Sum over 7 to 
obtain (a).] Thus P{R, + Ry =k} = ("t™plgr™*, k =0,1,...,n +m. 
(Intuitively, R, + R, is the number of successes in n + m Bernoulli trials, with 
probability of success p on a given trial.) Now P{R, =/j, Rj + R, =k} = 
PIR, =f, Re=k —j} = QPP IG pr ge = Gr pigs J = 0, 
Livemg lyk = Jo) Pines. Jn 
Thus P{R, =j| Ry + Rg =k} = GE) 
1 J | 4 2 (nem) ) 
function (see Problem 7 of Section 1.5). Intuitively, given that k successes have 
occurred inn + m trials, the positions for the successes may be chosen in ("+,™) 
ways. The number of such selections in which j successes occur in the first 
trials is (*)(,™,). 


the hypergeometric probability 


CHAPTER 3 


Section 3.2 


2. (a) E({R,]) = | [x] f,@) dx = | [x]le* dx 
a . 


2 3 4 n+] 
-| e~* dx +| 2e~* dx + | Seedee- +| ne“ dx +--- 


1 2 3 n 


=e1t—e*%+42(e*% —e%) + 3(e% —e*44+-::: 


=elieg%4re34.-- = Td1+tet+te*%+--)= 
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n+1 
(b) P{R, =n} =P{n< Ri, <n+}= | e-® da = e-h — e- (m4) 
n 


E(R2) = >) nP{R, =n} = >) nle-™ — e+] 


n=0 n=0 
=elt—e*%+2e% -e%) +3e% -—eN)+--:- 
el 


= as above. 
1 —e 


3. (a) 1 (b)O0 (c) 1 
4. 1/3 


5. 2 + 30e3 


8. EL[R(R—1)--:-(R—r + pape ear e e 


k=0 


A Wi 


Set r =1 to obtain E(R) =A; set r =2 to obtain E(R? — R) = 4”, hence 
E(R’) = 4 + 4. It follows that Var R = A. 


Section 3.3 
1. This is immediate from Theorem 2 of Section 2.7. 
2. E[(R — m)"] = 0, n odd 
= o"(n — 1)(n — 3)--- (5)(3)(1), 2 even 
Section 3.4 
1. Let a(R, — ER,) + b(R, — ER,) = 0 (with probability 1). If, say, b # 0 then 
we may write R, — ER, = ae ER,). Thus o,” = c*o,? and Cov (R,, Ry) = 


co,*. Therefore p(R,, R.) = , hence |p| = 1. 
1 1» Ag 


CO} 
Ic| o,” 
4. In (a), let R take the values 1,2,...,m, each with probability 1/n, and set 

R, = g(R), Re = A(R), where g(@) = a;, h(i) = b;. Then 


nm | ae | n | 
E(R,R2) =), - a;b;, E(R,®) =>, - a2, — E(R,”) =D, - b? 
p=1 1 i=1 it i=1 1 
In (b) let R be uniformly distributed between a and 6b, and set R, = g(R), 
R, = h(R). Then 


b g(x)h(x) e 
E(R,R;) = | & ——— de, E(R) = i} ; 


In each case the result follows from Theorem 2 of Section 3.4. 
5. This follows from the argument of property 7, Section 3.3. 


2 2 
BO sae [’ 1?) 
—a a O 


—a@ 
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Section 3.5 
ni 
ae ig =] = 22-5 
3. P{R, =j, Rg =k} = jik@—j Hi 
j,k =0,1,...,", 7 +k <n (see Example 1, Section 2.9). 


4. (n — 1)p — p). 

5. Let A; = gg ie i results in success and trial i + 1 in failure}. Then R, = > I 
and Ry” =S 4 ++ 2 2 taba, Ly 4,,, = andiffj >i +2, 1,, and Ly, e 
independent, with EU, 4 14) = = P(A;)P(A;) = (pq)*. Thus 

E(R}) = (n= pg +23, S (pa? 
i=1 j=i+2 
n—3 
= (n — 1)pq + 2(pq?? > @ -—2 —-i) 
i=1 
= (n — 1)pq + 2(pq)*[1 +2 +°--+ (x — 3)] 
= (n — 1)pq + (pq)*(n — 3)(n — 2). 


Therefore, Var Ry = E(Ro”) — [E(Ry) = ( — pq + (n — 2) — 3)(—pq)? — 
(n — 1)®(pq)*, g = 1 — p (assuming n > 2). 
6. 50(49/50)19 


Section 3.6 

1. (a) .532 
(b) —2.84 

Section 3.7 


l. m= fo ve * dx =1, E(R*) = fp ve *dx =2, hence o? = 1. P{|R —m| > 
ko} = P\|R —1| >k}. IFO <k <1, this is (Pe *de + [eh e*%dx=1 — 
eG") + e+"), When k > 1, it becomes [75 ,e~* dx = e+"), Notice that the 
Chebyshev bound is vacuous whenk < 1, and fork > 1, e“**) approaches zero 
much more rapidly than 1/k?. 


CHAPTER 4 
Section 4.2 
1. V(r) = fp ete dt = (with t = %*) 2 [9 227 le@ da. 
Thus T(r)T(s) = 4 fp fe x? ty2steta+a) dar dy 
= (in polar coordinates) 4 (7/2 d6 (5° (cos 6)?”1(sin 6)?5te—P* p2r+2s-1 dp, 
Now [3° p?"*#sle~p* dp = (set u = p*)t (> urts te du = 41 (r + 5) 


TOT (s) _ ie 


Therefore 210 +5) = 


(cos 6)?7—l(sin 0)?5—! dé 


0 
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Let z = cos? 6 so that 1 — z = sin? 6, dz = —2 cos 0sin 0 d6, 


dz 
a= — 5AnqG — ape 


Thus 
T(r)T(s) ? 
3. py(e® — e) + pple — 8) + pale — €) + pall — €*) + pg — 


5. foly) = 4, O<y<l 
1 


=—=, > 1 
2y” 
Section 4.3 


1. h(y|«) =e" 4, 0 <a <y, and 0 elsewhere; P{R, <y|R, =x} =1 —e* 4, 
y > «x, and 0 elsewhere. 


2. f(x) -| kx dy =kx(x +1),0<24 <1 
1 


x 
-| —kx dy = —kx(x + 1), -1 <a <0 
iy 


1 
fo) -| kx dx = 3k(l —y’),0<y <1 
7] 


0 1 
-{ —kx dx +{ kx dx =4k(1 + y*), -1 <y <0 
y 


0 


Since f1, fi(~) dx = {1, f(y) dy =k, we must have k = 1. 
The conditional density of R, given R, is 


_f@,y) 1 
hala a aa l<e<l,-l<sy<u 
The conditional density of R, given Ry is 
fey) 2x 
h,(x\y) = = 9O<y<ly<x<l 
it | fr) 1-y : 
= l<y<0,0<27<1 
1 +y?’ SY > v 
a <0,y <x <0 
maar er, SY Oo 


4. The conditional density of R, given R, = xis h(z| xv) =e *,z > 0,4 > 0, 


Pili < R, <2|R, =x} =et-e7,x4>0 
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5. P{g(Ry, Rp.) <z| Ry =x} = Plg(w, R,) <2z|R, = 2} = hty | x) dy 


Section 4.4 


{y: g(x,y) <2} 


8 2 
1. (a) iney|2) oe 


See eyes a4 
Ay) 440 -¥) 1—-¥# 


00 x (92 
Thus E(R, | R, = x) -| yholy | x) dy -| (73) dy = 3x,0<a<1 
— 00 0 


-¢) 1 
E(R, | R, = y) -| xh, (x'|-y) dx -{ x 


y 


0<y<l 


Z 
(b) E(R,*| R, =x) = \" y*h(y | x) dy = | v(3) dy = 1x4 
: 0 


e—oo 


(c) The conditional density 


of (R,, R,) given A is 


fe,y| 4) =“ = 


The conditional density 


0 0 
of R, given A is 
1/2 


hy | A) -| feey| Ade =| 128xy dx = loy — 64y°, O0O< y <4 


y 
= 0 elsewhere 


foe) 1/2 
E(R, | A) -| Yfoly | A) ay -| y(l6y — 64y*) dy 
0 


Alternatively, E(R, | A) = 


Alternatively, E(R, | A) 
fi@/P(A)ifO<a#< 


1/ 
E(R, | A) = | 
0 


5, 


n! > 1+ 


1/2 [a 
1 “y(8xy) dy dx 
E(Rol4) Jp Jp meen aa ho = 4/15 
POS ee ae 
Sxy dy dx 
0 vo 


Ps i, fx@| ADE(Re | R, = x) dx, where f,(# | A) = 
4, and 0 elsewhere. Thus 


2 Ags 


+ Ly 
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. E(R, | Ry = x) = 2/2, 0<2x<1 

=1/2, 1I<a<2 

= (x — 1)/2, 2<% <3 


4. $(n —k) 
. (a) 3/7 
(b) 19/21 
. E(R) = 1P{R = 1} + 2P{R = 2} + 3P{R = 3} 
Meee 9) 30! 1 (50) (30) (20 
07107 \10 
MR= l= aap ee a (0) , 


P{R = 2} = 2€9)(.3)"¢.7)8 
P{R = 3} =1 — P{R = 1} — P{R = 2} 


. (a) P{R, < 2} = P{R, <2, R22 + R22 < 1} + P{R, <z, R22 4+ R2 > 1h. If 
0 <z <1, then R,? + R,2 > 1 implies R, = 2 >z, so R{R, <2, KR? + 
R,? > 1} =0. Thus 


F,(z) = P{R, <2, R,?2 + Rp? < 1} 


Z (1— 92) 1/2 5 
f[@, y) dx dy -| an | dy -| (_ — ap2)l/2 dx 
0 0 0 


ee+y?<1, 
xz 


= d[e(l — 2%)? + arcsinz],0 <z <1 


If 1 <z <2, P{R, <2, R,2 + Rp? > 1} is still 0, and F,(z) = P{R, <z, 


F(Z) 


By 


PROBLEM 4.4.8 


By (4.4.6), E(R;) = 2P{R, = 2} + | "2(L — 22)"2 de 
0 


of4 T 1 7 @ 
Se Ay ge gr 
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(b) | | &(x, y) fla, y) de dy = { | xf (x) fo(y) dx dy 
<< 2 ey? <1 


+ {| 2f,() foly) dx dy 


ety? >1 


1 (1—a?)1/2 
= [eae dy +2 {| dx dy 
0 0 


0<27<1,0<y<1, 
ery? >1 


1 7 
-| a(1 — x)l/2 dx + (1 — 4 
P 4 


= 4 2(1 = i) as before. 


(c) Since R; = 2 when R,? + R,? > 1, ECR; | Ry? + R,? > 1) = 2. The con- 
ditional density of (R,, R.) given A = {R,? + R,? < 1} is 


f@,y) _4 


f(%,y|A) = 


= 0 elsewhere 


E(Rel4)  E(R 4) . eee 
pa PA) since R,? + R,? < 1 implies R, = R, 


E(R; | A) = 


= E(R, | A) -| | w fe, y | A) de dy = |" 2h @| Ade 


where 


00 4 
fi@ | A) -| f@,y| A) dy =— (A — ole 6-60 <a <1 


= 0 elsewhere 


TT 


ine 1 (1—a?)1/2 
x] 2-1, du dy [ a aw | dy 
E(R,L,) aes {a2 +y? <1} _ Jo ‘ 


P(A) 7/4 7/4 


14 4 
Thus E(R, | A) = —a2(1 — x2 de = — 
Jo 7 3 


Alternatively, 


Now E(R3) = P(A)E(R; | A) + P(A9E(Rs | 49 


TT 4 1 ape TT bef 
=4\37 a —| a ce elore. 
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10. Let Rg = R, + Ro, Ry = R, — Ry. The joint density of R; and R, is 


11. 


iar) O(a, y) 
g4\™> en Le »Y a(u, v) 
where u=x+y,v =x" —-—y, « =4(u +0), y =4(u — v) (See Section 2.8, 


problem 12). Thus fy,(u,v) = 3, O< xe <1, O<y <1. But O<e <i, 
0 <y < 1 corresponds to (u, v) € D (see diagram). 


(1,-)) 


PROBLEM 4.4.10 


(0.6) 
The density of R, is f,(v) -| Saat, v) du = 1 — oI, lo] <1 
= 0 elsewhere 
faa(4, V) 1 
Therefore h.(u | v) = = ————. ,0<v<lrv<u<2-v 
LO Fay 20-8) 
1 
= ——— , — <0, —v<u< 
a +o)’ 1<v <0, -v<u<2+0 
| ; 1 ee 4-—2v+v° 
PO Ste a en 2 aia PAE er 
1 2+0 4+2v+0° 
—_ R,” = = 2 = 
If —1 <v <0, E(R,*| Ry = v) aaa | 3 


Thus E[(R, + R,)*| R, — Rp =v) = 4[4 — 2 |v] +07], -1 gv <1. 
a* + 7/6 


14. FrA| Ry = %,,...,R, =2%,) =A — A)” 7/60 +4,n —x +1),0<4 <1, 


16. 
17. 


where & = 2%, +--+ +2y 
E(R | Ry = %,..., Ry = %m) = (& + 1)/(n + 2) 
(np — mpqr)i(L — q” — npq”) 
1 + a8 — BP) G's — toV2) 
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1 
18. fio(@,y|2) =—,x* +y* <2, and 0 elsewhere 


(a) E(D| R =z) of [ (x® + 9)", a(x, y | 2) de dy 


1 
(a? ae y?)i/2 dx dy 


WZ 
lrg +y? < 22 


1 20 z 00 
= | ao] r? dr = 22. Thus E(D) -| gze* dz = 2 


(D) h(z | w y) = f(x, Y, 2) fial®, y) = fr@fe@, Y | a/ |" re, Y, z) dz 


—z 2 
7 ~ € cae 2 > (x? + y2yi/2 
(a2+y2)'/2 7 
19. (b) 
d(x) = —-1, S32 eS a] 
= Q, —l<x<l 
=1, l<2z <3 


E_;((6* — 6)2] = 1/2 


20. 4@@ + 1) 


CHAPTER 5 


Section 5.2 
1 3 
1. No(s) = Ny(s)No(s)Na(s) = E (= =| 


ae? = 3e Fs 3e-§ — e 3s) alls 
fo) = eel@ + 3)?uw@e + 3) — 3(@ + 1)?u(e + 1) + 3(@@ — 1)?u(% — 1) 


—(« — 3)*u(x — 3)] 
Thus 
fo@) =7e@ + 3%, -3 Se < -1 
3 — x* 
ee: —-1<2z<l 
= 75(¢ —3)*,1 <x <3 


= 0 elsewhere. 
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2. fo(z) = sl(@ + 2)u(u + 2) + 2(@ + I)u(e + 1) — 3xu(x) — 4(@ — Du(e — 1) 
+4(a — 2)u(x — 2)] 


1 
3. Np 3(8) = E(e*,’) = { esa? Zz e #2 dy = [with y = (s + L)1/2] 


Lae T 


| [2n(s + 3) %e%" dy = [2(s + 3)”, Res > —3 


n 
Nas) = [[ Naas) = 2/5 + 4), Res > —3 
i=1 ' 


and the result follows from Table 5.1.1. 


4.N (s) — = oe gore —(stBp-1)a dx 
4 R S T(x) pe : e 
1 co y*le-y 1 1 
== —_—_—— dy = ——,, Res > --= 
Tope J, (s +B D*" ~ 1 + Bs) B 


If Ry = R, + R, where R, and R, are independent and R; has the gamma 
distribution with parameters «; and 6, i = 1, 2, then 


1 Ay -++ He 
No(s) = Ny(5) No(s) = (<,) ,Res > —1/B 


sO Ry has the gamma distribution with parameters «, + «, and £. 
5. fo(%) = A™a™ Tey (x)/(n — 1)! 
—iuz 


a(1 + 27) 
u < 0. (Notice that |e~*“("+*)| is bounded if u > 0, as long asy < 0.)Ifu > 0, 


9. Integrate around contour (a) if u >0, and around contour (5) if 


(a) 


PROBLEM 5.2.9 
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e-tuz e-tul—i)] 
ad ra atc -i = St ay, =e 
e ius Qnife—@(2)] 
nd +?) atz = | = ai) = 


—U 


the integral is —27i residue of 


U 


If u < 0, the integral is 27i residue of 


The result follows since the integral around the semicircle of radius r approaches 
Oasr— o. 


Section 5.3 


4. Mp(u) = > pre @o+"4) where p, = P{R = a + nd}. Thus Mp(u) = 


n=— 00 
e *4aM(u) where M is periodic with period 2/d; the numbers p, are the 
coefficients of the Fourier series of M. 


5. Np(s) = exp [A(e“* — 1)], hence = = —he‘exp [A(e* — 1)] = —A when 
d2 
= 0. a = (Ae-5)? exp [A(e~S — 1)] + Ae~* exp [A(e~s — 1)] = A+ A 
KY 


when s = 0. By (5.3.3), E(R) = 4,E(R*) = 7? + 4, hence Var R = 4. 
Section 5.4 


1. (a) P{R, <x} =P{R, <x%,R>u+e}+P{R, <2%,R<xt se} 

<P{\R, —R| >e} + P{R<xt+e} 

since R, <2, R > x + «implies |R, — R| > «and 

R, <*%,R<«+ecimpliesR <2% +. 
Similarly P{R <x — el =P{R<«—¢,R, >x}+P{R<«—e,R, <x} 
<P{R, —R| >eh + P{R, <4} 
Thus F(z — «) — P{|R, — R| >} < F,(@) < PIR, —R| >e} + F@ +8) 
(b) Given 6 > 0, choose ¢ > 0 so small that 


) ) 
FR +e) < FR) +;, BS) 2 EG) 5 


(This is possible since F is continuous at x.) For large enough n, 
P{|R,, — R| > e} < 6/2 since R, > R. By (a), F(x) — 6 < F,(&) < F(x) + 
6 for large enough w. Thus F,,(”) -> F(#). 
5. (a) n > 1,690,000 
(b) n > 9604 


7. P{|R — $n| > .005n} = P| 


R—1n\ .005n = 
Leet cee ~ P{|R*| > .01 Vn} 


Avi Vn 
ae 2 00 
= 2P R* > .O1 Vn — | e/2 dt < —— = e 0001n /2 
V20 Join VI7 Vy 


200 Z 
= —— e(1/2)10-‘n, For example, if m = 108, this is ——e°° 
V20n 2a 
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CHAPTER 6 


Section 6.2 


1. f,(@@) = 4,% and f,(~) = 4,” [or f(x) = 4*, fo(x) = xA® in the repeated root case] 
are linearly independent solutions. For if c, f, + cof = 0 then 


cAetl + cofztl = 0 


If c, and c, are not both 0, then 


A,* Ao” 
ee lee 
hence 
1 1 
= 0, a contradiction 
A, Ag 


If fis any solution then for some constants 4 and C, 


fO fiO) f2(0) 
=A eC. 
fQ) A@ f2Q) 
since three vectors in a two-dimensional space are linearly dependent. But then 


1 = 
FO) = da Ota (C)y ai 5 3da= = 


= d,[Af,0) + C(0)] + 4,[44,0) + G2()] 
= Af,(2) + C2(2) 


Recursively, f(x) = Af,(x) + Cfo(x). Thus all solutions are of the form Af, (x) + 
Cf,(x). But we have shown in the text that A and C are uniquely determined by 
the boundary conditions at 0 and 5; the result follows. 


. By the theorem of total expectation, if R is the duration of the game and A = 
{win on trial 1} then 


E(R) = P(A)E(R | A) + P(A) E(R | A’) 
Thus 


D(«) = p{1 + D@ + 1) + q{1 + D@ — 1)],% =1,2,...,b5—-1 


since if we win on trial 1, the game has already lasted for one trial, and the 
average number of trials remaining after the first is D(z + 1). [Notice that this 
argument, just as the one leading to (6.2.1), is intuitive rather than formal.] 

. In standard form, pD(# + 2) — D@ + 1) +qD() = —p —q = —1, D(O) = 
D(b) = 0. 
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Case l. p ¥4@. 


The homogeneous equation is the same as (6.2.1), with solution A + C(g/p)*. 
To find a particular solution, notice that the “forcing function” —1 already 
satisfies the homogeneous equation, so try D(x) = kx. Then 


k[p(@ +2) — (@ + 1) +42] = 2p —- Ik = (p —gk = —- 
Thus 


4 Oy 
D(x) = A + C(g/p)* +-——.. 
giP q—P 
Set D(O) = a) = 0 to solve for A and C. 


Case 2. p =q = 1/2. 
The homogeneous solution is A + Cx. Since polynomials of degree 0 and 1 


already satisfy the homogeneous equation, try as a particular solution 
D(x) = kx?. Then 


k[h(e@ + 2)? — («@ +1)? +427] =k = -1 
Thus 


D(w) =A + Ce — 2 
DO) = A = 0, D(b) = D(C — b) = so that C = b. 
Therefore D(x) = x(b — 2). 


If we let b — oo we obtain 


D(x) = 0 if p >g; D(x) meee <q 
q—P 
Section 6.3 


1. P{S,; > 0,..., Son, > 0, Syn = 0} is the number of paths from (0, 0) to (2, 0) 
lying on or above the axis, times (pg)”. These paths are in one-to-one corre- 
spondence with the paths from (—1, —1) to (2n, 0) lying above —1 [connect 
(—1, —1) to (0, 0) to establish the correspondence]. Thus the number of paths 
is the same as the number from (0, 0) to (2” + 1, 1) lying above 0, namely 


2n + 1 —b 
wherea +5 =2n+1,a—b=1,thatis,a=n+1,b=n 
a 2n + 1 


Thus the desired probability is 


2n +1 1 n — (2 n)! n — Ugn 
nei) iat nin Fi PP n+1 


2n (2n)! (2n)2"” V272n (4pq)” 1 
3. Usn = (7) ear = ae te = a ie if 
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By Problem 2, 
a Usn_9 i 1 1 
"On 2n Vin —1)n 2V rl? 


Let T be the time required to return to 0. Then 


[e.) 
P{T = 2n} = hyn, n = 1, 2,-... where > hg, = 1 


n=1 


E(T) = S 2nP{T = 2n} = >) 2nhon 


n=1 n=1 


But 27h,,, ~K/Vn and & 1/vn = 00, hence E(T) = o. 
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. The probability that both players will have k heads is [(#)(4)"?; sum from k = 0 


to n to obtain the desired result. 


. The probability that both players will receive the same number of heads = the 
probability that the number of heads obtained by player 1 = the number of 
tails obtained by player 2 (since p = g = 1/2), and this is the probability of being 
at 0 after 2” steps of a simple random walk, namely (2”)(3)?". Comparing this 
expression with the result of Problem 6, we obtain the desired conclusion. 
(Alternatively, we may use the formula of Section 2.9, Problem 4, with m = 


k =n.) 


Section 6.4 


1. (1 = 4pqe*y!2 = S12) —4pgaty 
n=0 


= > G1)(—4pqyree 
n=0 


Thus 
H@) = 1 — Cl ~ 4pqe* = — > 18)(—Apgyre™ 
n=1 
Thus 
Ron = (—1)"**(7/")(4pq)", n = 1, 2,... 
But 
cre (1P) 2 (ap PRICE) Tn — 9 
ie n! 
_1:°3-5-+- Qn —3) | (2n — 2)! 
OL Ont 2- 4+ + (Qn — 2) 


7 (2n — 2)! _ 2 (2n -—2 1\2" 
~ 2"n(n — 1)!2"4 MH — 1)! -=(7 —1 )(5) 


Therefore hz, = (2/n)C")(pq)", in agreement with (6.3.5). 


n—1 
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(ee) ie.) ee) 
2: Y Gy 12" —3> a2" = 4 2% = 
n—0 n=0 l—z 


n=0 
If 


A(z) = S a,z", then : [A(z) — ap] — 3A(2) = — 


n=0 
or 
4 Ly 
—1 = = ee 
A(z)(z 3) eee a = 
Thus 
A(2) 4z ay 
O- qa -% Ti-% 
2 2+ a an Sees 
cia ari TTI 3z = gi + (2 + a) > 3 z 


Thus a, = (2 + a)3” — 2. Notice that (2 + a,)3” is the homogeneous solution, 
—2 the particular solution. 


6. (a) P{N, =k} = P{the first k — 1 trials result in exactly r — 1 successes, and 
trial k results in a success} = (*=))p"1q*"p, k =r,r+1,.... Now 


k=) = ED =k - DK - 2-9 + Deke - 9! 
= (=) "(=n (-—r —1)(—r = 2)3 [=F = b= 2 = 9) 
x [-r -—(&k -—1—-—nIy/(k —-Dr)! 
= (—1)*(-*), and the result follows. 


Note that if 7 =k —r, this computation shows that (-1)G) = 
C6 at B= Lp 2yecisf = 0, wes 


(b) We show that 7, and T, are independent. The argument for 7,,..., T, is 
similar, but the notation becomes cumbersome. 
P{T, =j,T, =k} = P{R, = °° = Rey = 0, R, = 1, 
Raya = Rap = 9, Rae = 1} 


= p*gitk2, j, k a 1, 2. ae 
Now P{T, = 7} = q*\p by Problem 5, and 
PSE 2 ee (S99) 
j= . 
24 ee 


Hence P{T, = j, Tz = k} = P{T, =j}P{T. = k} and the result follows. 
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(c) EWN,) =r/p, Var N, =r{( — p)/p?] since N, = T, +--+: + T, and the T, 
are independent. The generalized characteristic function of N, is 


pes T = 
(Ba) wens 


Set s = iu to obtain the characteristic function, 2 = e~* to obtain the gener- 
ating function. 


7. P{R =k} = pg +q'p,k =1,2,...; E(R) = pq? + 4p 
8. 1/V2 
Section 6.5 


2. P{an even number of customers arrives in (¢, ¢ + 7]} = ae hts e4t(Ar)*/k! 
(e.6) (o@) 
= +> e 4t(Ar)*/k! + i> e41(—Ar)*/k! 
k=0 k=0 


= fom(e4r +e) = 4 +e) 


P{an odd number of customers arrives in (¢, f + 7] = Saree 5... € (Ar) k! 


= 4S cantik! — 2S eAr(—A) ke! 
k=0 k=0 
—_ ze At (e47 —_ Ar) = 4(1 a2 e—24r) 


(Ar) 
: 00 
oa we may note that De=0.204.... kh 
de=3.5.... k! =e 


= cosh Ar and 


3. (a) P{R, = 1, R,,, = 1} = P{R, = -1, RR, = —1} =420 + 4) 
P{R, = 1, Ris, = —1} = P{R, = = 1 Rea. = 1} = +1 a e—2Ar) 
(b) K(t, 7) = e247 


Section 6.6 


4. (a) is immediate from Theorem 1. 


(b) If we A, for infinitely many n, then w€ A, which in turn implies that 
wo € A, eventually. Thus lim sup A, © A © lim inf A, < lim sup A,, so all 
these sets are equal. 

(c) Let A, = [1 — 1/n, 2 — 1/n]; limsup A, = lim inf A, = [1,2) (Another 
example. If the A, are disjoint, lim sup 4, = liminf A, = @.) 

(qd) 1x, 4n © An © US, An, and B, = (12, 4x expands to lim, inf 
A, = A, C, = U2, A, contracts to lim, sup A, = A. Thus P(A,) is 
boxed between P(B,,) and P(C,,), each of which approaches P(A). 

(e) dim, inf 4,)° =(UR Nin 40)” = NRa Ue, An? = limn sup A,° 
by the DeMorgan laws; (lim, sup A,)° = lim, inf A,° similarly. 

6. liminf A, = {z, y): 2? + y? < 1}, 
lim sup A, = {(x, y): 27 + y? < 1} — {(, 1), (0, —1)}. 
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8. P(lim, sup A,) = lim, ,,, P(U&,, Ax) by definition of lim sup, hence 


P(lim, sup A,) = lim, _,., lim,,..,. P(U™,4 
Now 


m Cc m m 
P( U A] = P( Nd Ae = [| P(4,°) by independence 
k=n k=n k=n 
< T[ oP (40 since P(A,9) = 1 — P(A, < e-P40 
k=n 
— .: P(Ax) 
= @ k=n —>0as m-—> © since > P(An) = © 
The result follows. . : 

9. If Sr, P{|Rn — cl > e} < w forevery e > 0, R,, = Se by Theorem 5. Thus 
assume that)”, P{|R, — cl > e} = for some ¢ > 0. Then by the second 
Borel-Cantelli lemma, P{|R, —c| >e« for infinitely many n} =1. But 
|R, — c| > «for infinitely many‘n implies that R, +> c, hence P{R, +> c} = 1, 
that is, P{R, —c} = 0. 


12. Let S, = (R, + °°: + R,)/n; then E(S,/n)* = (1/n?) Var S, 


1 
= — Var R 
a2, eS iaik 
But 
1 1 n gopke tele, 
ei eyaes +-<| -— de =Inn (notice—- <-ifk-—l<x<k 
) a: n~ jy @ k~ & 
Hence 
S,~% M 
E{—]} <- (1 + Inn), so that 
n n 
00 Sy \ Sin Ss. 
> (> < oO. By Theorem 6, — ——> 0 
n=1 n is 
CHAPTER 7 
Section 7.1 


P{R,, = j, Ry =r} 
> Pr Prix a "Pin gin gP igo 


“1 > ee ey tn—1 


Pr ee 


> P{T, =Ig,---5 Tn_y = In_y, T, n =} 


zo yee e9 tn—1 


> Vr Pri, ay * Pin_otn—1 Pind =p 


tj: ee eytn—1 


1. P{R, =j| Ry =r} = 


P{T, = j} 
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4. P{Rng = t1, +++ Rape = iy | Rn =i} 


fa P{R, = 1, Riss = ly, sey Ris = ip} /P{Rn oa i} 
> Pi) Pigit ”  Pin-2in—1PinstPity © °° Pin _sty 
= dos eens In—1 


> Pig Pigiv °° Pin-2in-1Pinrt 


Jo gee eg In—1 


= Pii, ” Piy_aty 


Section 7.2 


1. The desired probability is 


D 


P[D O§R,, = i¥] 


must be of the form (Ry,..., R,) € B for some B< S”*!, hence D is a 


countable union of sets of the form {Ry = jp, ..- 5 Rn = jn}. Now 


P{Ry = jos ---> Rn =Jns Rn =1,---5 Rape = ef =O if In Fit 


= P{R, =for--+> Rn = fn) Rn = i}P{Rayy =1,---5 Rare = i, | Ry = i} 


if j, =i 


Thus the numerator is simply 


PID A{Ry =D P(Rags = ts --- > Raye =| Rn =} 


and the result follows. | 


Section 7.3 


1. (a) If n, € A, then n, is a multiple of d, say n, = r,d. If r,; = 1 then {n,} is the 


(b) 


required set. If not choose n, € A such that n, does not divide na (if n, does 
not exist then d=n,, hence r, = 1). If mg =rjd then gcd(n, m) = 
gcd(r,d, r,d) =r,d for some positive integer re <r,; (if re =1r,, then n, 
divides n.). If r, = 1, {,, m2} is the required set, if not, find m3€A such 
that red does not divide ng. [If ns does not exist, then rd divides everything 
in A including m, and n,. But red = gcd(m, nz) so that red =d, hence 
ro = 1, a contradiction.] If ns = rjd then gcd(ny, ng, ng) = gcd (red, rd) = 
red, rg < ro. If rz = 1, {n,, mo, ng} is the desired set. If not, find m,€ A such 
that rsd does not divide m,, and continue in this fashion. We obtain a 
decreasing sequence of positive integers r, > rg > --- ; the sequence must 
terminate in a finite number of steps, thus yielding a finite set whose gcd is d. 


By (a) there are integers n,,... , m,€ A, such that gcd(m, ..., m,) = d. Thus 
for some integers c,,..., c, wehavecyn, +:°-- + Cyt = d. Collect positive 
and negative terms to obtain, since A is closed under addition, md, nd € A 
such that md — nd = d, thatis,m —n = 1. Now letg = cd, c > n(n — 1). 
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Writec =an +b,a>n—1,0<b <n —1.Theng =cd = [@ —b)n + 
(bn + b)]d = [(a — b)n + bm]deE A, since a — b > Oand A is closed under 
addition. The result follows. 


(c) Let A ={n >1:p >0}, B={n >1:f™ > 0}. Then d; = gcd(A). If 


4. (a) 


e; = gcd(B), then since B < A we have e; > d;. To show that d; > e; we 
show that e; divides all elements of A. If this is not the case, let m be the 
smallest positive integer such that p‘”) > 0 and e,; does not divide n. Write 
n =ae;+b,0 <b <e;. Then ~ 


P =D sf Ppl =D siyropigerto re 
k=1 r=1 
Now e; does not divide (a — r)e; + b, so by minimality of 7, 
piaet?-res) = O for allr =1,2,...,4. 
But then p‘”) = 0, a contradiction. The result follows. 
S forms a single aperiodic equivalence class. Starting from 1, the probability 


of returning to 1 is 1 if p <q, and q + p(g/p) = 2q if p > q (see 6.2.6). 
Thus S is recurrent if p < q, transient if p > q. 


(b) and (c) S forms a single aperiodic equivalence class. By Theorem 6, S is 


recurrent. 


(d) The equivalence classes are C = {1} and D = {2, 3}. C is not closed, hence 


is transient; D is closed, hence recurrent. C and D are aperiodic. 


Section 7.4 


1. For each nso that f,, > 0, form the following system of states (always originating 
from a fixed state 7). It is clear from the construction that f‘") = f, for all n. 


PROBLEM 7.4.1 


Also, p\”) = u,, by induction, using the First Entrance Theorem. 
Note. we need not have gcd{j: f; > 0} = 1 for this construction. 
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2. Construct a Markov chain such that for some i, 
f@ =P{T, =n}, ae Ce re 


We claim that G(m) = p\”) for all n =1,2,.... For G(l) = P{T, = 1} = 
f YD = pp, and if GY) = p® forr = 1,2,...,n, then 


ak a a cl 


=> SP, means eS eb dre ed ey 


k=1 l= 


= Sf OG(n +1 —D, if we define G(0) = 1 
tl=1 


n+1 

= 2f ()p(n+1—l) by induction hypothesis 
l= 

= p\"+1) by the First Entrance Theorem 


Now state i is recurrent since }*” , P{T, = n} = 1, and has period d by hy- 
pothesis. Thus by Theorem 2(c), lim,..,, G(md) = d/u. 

Since a renewal can only take place at times nd, n = 1,2,..., G(nd) is the 
probability that a renewal takes place in the interval [nd, (n + 1) d). If the average 
length of time between renewals is uw, for large n it is reasonable that one 
renewal should take place every seconds, hence there should be, on the average, 


d/u renewals in a time interval of length d. Thus we expect intuitively that 
G(nd) —_ d/ HL. 


4. (a) Let the initial state be i. Then V,; = > =, Ip 55 and the result follows. 


(b) By (a), N = >”, Q” so that QN = >”, Q” = N — I. (In particular, QN 
is finite.) 


(c) By (), T-Q)N=I. But N=” QQ” so that QN = NQ, hence 
NU —Q) =1. 


Section 7.5 


2. (a) jo is recurrent since Pip = > 6 for all n > N (see Problem 2, Section 7.1), 
hence 


= 
Sp Lh -_ 
n=1 
If i is any recurrent state then since Pi}. ) > 6 > 0, i leads to jy. By Theorem 


5 of Section 7.3, jg leads to i, so that there can only be one recurrent class. 
Since en > 6 > 0 for all n > N, the class is aperiodic, so that lim, py. = 


1/4; But then 1/u;, = 6>0, hence u, < o and the class is positive. 
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(Note also that if i is any state and C is the equivalence class of jo, then, 
for n>N, P{R,¢C| Ry =i} <1 —46, hence P{Ryn ¢~C| Ry =i} < 
(1 — 6) +O0as k > o. Thus fiz, = 1, and it follows that a steady state 
distribution exists.) 


(b) If | [¥ has a positive column then by (a) there is exactly one recurrent class, 
which is (positive and) aperiodic, and therefore a steady state distribution 
exists. Conversely, let {v;} be a steady state distribution. Pick j) so that 

[= lim, . . p{?] > 0. Since the chain is finite, p{} > 0 for all i if n is 
sUicieatty large, say n > N. But then []% has a positive column. 


3.1. If p #q, the chain is transient, so p{») > 0 for all i, j. If p =q the chain is 
recurrent. We have observed (Problem 5, Section 6.3) that the mean recurrence 
time is infinite, hence the chain is recurrent null, and thus p‘”) — 0. In either 
case there is no stationary distribution, hence no steady state distribution. 
The period is 2. 


2. There is one positive recurrent class, namely {0}; the remaining states form a 
transient class. Thus there is a unique stationary distribution, given by vy = 1, 
v; = 0, 7 => 1. Now starting from i > 1, the probability of eventually reaching 
0 is lim, , . p{”, since the events {R,, = 0} expand to {0 is reached eventually}. 
By (6.2.6), 


limp) =(q/p)* if p>q 


n> &O 


=1 iif p<q 


(Also p{”) = 1, p\”) +0, 7 > 1). If p > q the limit is not independent of i so 
there is no steady state distribution. 


3. There are two positive recurrent classes {0} and {b}. {1,2,...,5 —1} isa 
transient class. Thus, there are uncountably many stationary distributions, 
given by v9 = Py, V9 = Po, ; = 9,1 <i < b — 1, where p,, pp = 0, py + po = 
1. There is no steady state distribution. By (6.2.3) and (6.2.4), 


tim pind = “PD = GP” 


ges 1 — (g/p)? bard 


a, f — 
5 1 P=4 
lim p(n) = 1 — limp) 
1 (n) — ; ne 
limp =0,  1<j<o-1 


4. The chain is aperiodic. If p > g then f;, = (g/p)** < 1,i > 1, hence the chain 
is transient. Therefore p{") > 0 as n> ~ for alli, j, and there is no stationary 
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or steady state distribution. Now if p <q then f, = 1 fori > 1, hence fi, = 
g + pfiz = 1, and the chain is recurrent. The equations VII = V become 


V1g + Veg = Vy 
V1p + U3q = Ve 


Uop + V4q = Vg 


This may be reduced to v; = (p/q)v;_,, ] = 2,3,.... If p = @ then all v; are 
equal, hence v; = 0 and there is no stationary or steady state distribution. Thus 
the chain is recurrent null. If p < q, the condition }° , v; = 1 yields the unique 


solution 

= j-l1 

 - 4? () ; j= lgeeass 
q q 


Thus there is a unique stationary distribution, so that the chain is recurrent 
positive; {v;} is also the steady state distribution. 


5. The chain forms a recurrent positive aperiodic class, hence p‘”) —> v; where the 
v; form the unique stationary distribution and the steady state distribution. 
The equations VII = V, >; v; = 1 yield 


_ (p/q)** 
U 
> (piq)i 
j=1 


j 


6. The chain forms a recurrent positive aperiodic class. Since H? has identical 
rows (p*, pq, qp,q*) = V, there is a steady state distribution (= the unique 
stationary distribution), namely V. 


7. The chain forms a recurrent positive aperiodic class, hence p{”) —> v; where the 
v; form the unique stationary distribution and the steady state distribution. 
The equations VII = V, $;v; = 1 yield 


iT 


| — 1 = 
Vv) = 34> Vg = 3> V3 = 24> “4 >= 


io) 


8. There is a single positive recurrent class {2, 3}, which is aperiodic, hence 
p\) —> v; where the v,; form the unique stationary distribution and the steady 
state distribution. We find that v, = 0, v, = 3/7, vs = 4/7. 

9. We may take p;; = P{R, = j} for all i, j (with initial distributionp; = P{R, =} 
also). The chain forms a recurrent class since from any initial state, P{Rn 
never = jf} = [[”, P{R, =j/} = [] 2p; =9. The class is aperiodic. 
Clearly v; =p; is a stationary distribution, so that the chain is recurrent 
positive and the stationary distribution is unique and coincides with the steady 
state distribution. 
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10. The chain forms a positive recurrent class of period 3 (see Section 7.3, Example 
2). Thus there is a unique stationary distribution given by 


colbo 


ee AL, — 2 pene ee 2h So ek mas Es ae 
y= 9, YVea=3 U9, 4 F=9. U4 F~T2. Vg =z W= 


Now the cyclically moving subclasses are C, = {1,2}, C, = {3,4}, C, = 
{5, 6, 7}. By Theorem 2(c) of Section 7.4, if ie C,,j € C,,, then p{3r+#) —> 3u,. 
Thus 


12345 67 
it 2000 0 0 
214-2 000 0 0 
330 0420 00 

m3 -.40 04 20 0 0 
5}0 000: +4 2 
6600008 + 2 
710000 7 2 
12345 67 
i170 042000 
2300420 0 0 
330000 + 3 

IB+1-.4/0 000 }f ze 2 
5}2 2000 0 0 
662 2000 00 
714 2000 00 
123445 67 
ro 000f & 
2300003 + 3 
332 2000 0 0 

Ipt+2 442 2000 0 0 
5}0 04 20 0 0 
6/0 0 4 20 0 0 
71/0 0420 00) 
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CHAPTER 8 


Section 8.2 


1. (a) L(@) = 2e */e* = 2e*, so L(x) > Aiffx <c = —In A/2. Thus 


c 
Oo -| e “dx =1~—e° 
0 


p= | 2e 2% dx = e ** = (1 — a)? 
Cc 


Hence as in Example 1 of the text, S, = {(a, (1 — «)*), 0 < « < 1} and 
S = {(«, 8):0<«<1,(1 — a)? < B <1 — @*}. 

(b) e * = 1 — « = .95, so that c = .051. Thus we reject Hy if x < .051, accept 
Hy if x > .051. We have 6 = (1 — «)? = .9025, which indicates that tests 
based on a single observation are not very promising here. 


(c) Set « = 6B = (1 — «)?; thus « = (3 — V5)/2 = .38 = 1 — e-, so that 


c = 477. 
3. (a) 

ze 12 3 4 5 6 

Po) $F € ESE 

Pi) 4 25 5 8 8 

We) ggbai 
LRT Rejection Region Acceptance Region 0% B 
0O<1< 3 all x empty 1 0 
B<1<38 «= 1,2 x = 3,4, 5,6 1 3} 
5-45 2 empty all x 0 1 


The admissible risk points are given in the diagram. 


PROBLEM 8.2.3(a) 
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(b) Reject with probability a if x = 1 or 2, accept if « = 3, 4, 5 or 6, where 
a/3 = .1, that is, a = .3. We have i — 6 = .3(1/2) = .15, or B = .85. 

(c) A = pe,/(1 — p)c, = 3/2, so reject with probability a if x = 1 or 2, accept 
ifw = 3,4,5,6, where ais any number in [0, 1]. Thus there are uncountably 
many Bayes solutions. 


n> i3. 
. By Problem 2d, the test is of the form 


p(@) =1if >a? >c 
k=1 


: n 
| k=1 
n 
= anything if > z,2 =c 
k=1 


Now if the true variance is 0, 5 °_, (R,”)/@ has the chi-square density ,, with 
n degrees of freedom (see Problem 3, Section 5.2), hence 


n [o.@) 
Pale > 2%" > | -| h,(x) de = A,(c/6) 
k=1 c/é 
(The numbers A, are tabulated in most statistics books.) Thus the error prob- 
abilities are given by 
a= A, (c/ 9) 


1-—f= A, (c/6,) 


For a given value of n, the specification of « determines c, which in turn 
determines 8. In practice, one must keep trying larger values of n until f is 
reduced to the desired figure. 


. First consider 6 = 6, versus 0 = 6,, 0, > 4p. 


L(x) = fy, @/fo,@) = (%/9,)” if 0 < t@) = max x, < 4% 
= oif & < tv) < 0, 
Let 4 = (6,/6,)” and consider the following test 
g(x) = 1 if L(x) > A, that is, if 6) < ¢(x) < 0, 
= 0 if L(x) < A (this never occurs) 
= 1 if L(x) = A and ¢t(z) < O51, that is, if 0 < t(z) < Ayal” 
= 0 if L(x) = A and t(x) > O«!/”, that is, if Og01/" < r(x) < 4% 


Since ¢(x) can never be <0 or > 4,, @ is exactly the test proposed in the 
statement of the problem. Its type 1 error probability is 


Po im: max 2; < Oj0!/™} = (Oya!/"/0,)” = « 
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Since y is a LRT, it is most powerful at level «. But » does not depend on 9,, 
hence is UMP for 6 = 6, versus 6 > 6). 
Now let 6, < 6). Then 


L(x) = (6,/6,)" if O < t(x) < 9,5 
«== O0if 6, <1@) < % 
Let 4 = (6,/6,)” and consider the following test. 
gy (x) = 1 if L(x) > A (this never occurs) 
= Oif L@ <A 
= 1 if L(@@) = A-and t(x) < Oya!” 
= Oif L(x) = A and t(x) > Oa!” 


Since ¢(x) cannot be > 4, in this case, p’ = y. Again, y is UMP for 6 = 4, 
versus 0 < 6, and the result follows. The power function is (see diagram) 


QO(6) = E,p =1if 0 < 0 < Oya” 
= (001/"/6)" = «(0,/0)", Oat” < 0 < % 
= 1 — Pfr: Oya!” < t(x) < 4%} 
= 1 — [(6,/6)" — (Oycet/n/6)"] 
= 1 — (1 — a)(6/6)", 0 > 6, 


0 Oyaiin 9 


PROBLEM 8.2.6 


7. The risk set is {(a, 6):0 <a <1, (1 —«)2”" < 6B <1 — «2™”}, and the set 
of admissible risk points is {(«, (1 — «)2™):0 < a < I}. 


10. If «(p) =a < a, let op’ =1 and 9, = (1 —D)y +t’, 0 <t <1. Then 
a(p,) = (1 — tay) + ta(¢’) 
B(y,) = A — OB(g) + tB(e’) 
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Since «(y) < %, «(g,) will be < « for some te (0, 1). But B(g’) = 0 and 
B(p) > 0 hence B(¢,) < B(¢y), contradicting the assumption that » is most 
powerful at level a». 

For the counterexample, let R be uniformly distributed between a and b, 
and let Hjy:a =0,b =1, H,:a =2,6 =3. Let 9,@) =1ifx >t, 9%) =0 
otherwise, where 0 < ¢ < 1. Then f(9,) = 0, a(g, =1 —1¢. Fort <1, 9, is 
most powerful at level «) = 1, but is of size <1. 


14. (a) 9, is Bayes with ce, = cg =1, p = 1/2 (hence A = 1), and L(@) = fo Olfa, (x) 
where f(x) = Tes , No(&,); the result follows. 
(b) Pp {v: gn(x) > 1} = Pa tgn(R) > 1} < Es, [en(R)1/2] by Chebyshev’s in- 
equality . 
— i Ey t(Ri) = [E, t(RpI” 
since all R; have the same density. 
(c) Var,, t(R,) = E, (Ry) —[E, t(Ry)P > 0 (assuming he, # he »» so Ey it( Ry < 
[E, (RDI. But 
2 he (&;) 
E,t (R,) = E4, [hg (Ry)/hg (Ry)I = hg, (&;) 5 (w,) dx; <1 
{xz:h g(a) > 0} "o 
Section 8.3 


1. (a) 6 = —n/>” _, Ing, (b) 6=2 (c) 6 = max (2,,..., Xp) 


vw kwon 


= |a| 


= r/x 


p(8) = O(1 — 0)/n 
. By (8.3.2) with g(0) = 1,0 < 6 < 1, we have 


1/n 
(x) |, Gera -9 = B@+2,n—-x+1) «+1 
YW =-_ O_O 
|, (Joa - r= ao Ba +1,n—e%+1) n+2 
0 iC 


6. For each x, we wish to minimize j r(nj6r(1 — 6)” *[(6 — w(x))?/6(1 — 6)] dé [see 
(8.3.1)]. In the same way that we derived (8.3.2), we find that 


— ey A —  —-————————OOn—n— — Ss ll 


[0 — 6)"-*-1 go Beem sm a 


0 


The risk function is 


[(R/n) — 6 1 1 
p,,(9) = Lo) aa a) = 6 — 6) Var, R= ; = constant 
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Section 8.4 


1. fo(®y,-.- 5%) = (8, — 4) [[*_, Tre,,0,)(@:), where J is an indicator function 
= (6 — 01) "Tro, 0) (min UT (_ 20,60} (max “;,) 


Thus in (a), (min R;, max R,) is sufficient; in (b), max R; is sufficient; in (c) 
min R; is sufficient 


n 
nr 6,—1 —X 04/05 
fo(@y> +++ 5 En) = (TT, eth 
t=1 


[T'(6,)0,°1]" 
hence if 9, 6, are both unknown, ([[?., R:;, >7., R,) is sufficient; if 9, is 
known, paar R; is sufficient; if 6, is known, [[?_, R; is sufficient. 


3. [[ [%, 8, [[%_,  —,)] is sufficient if 6, and 6, are unknown; if 6, is known, 
7, (1 — R;,) is sufficient, and if 6, is known, TT R; is sufficient. 


Section 8.5 
1. a — 1/n)?. 


2. (a) T has density fp(y) = ny"“1/6", 0 < y < 6 (Example 3, Section 2.8), so 
Eog(T) = §§ gw) fry) dy = nf") [8 y" ey) dy. If Eyg(T) =0 for all 
6 >0 then y”1e(y) = 0, hence g(y) = 0, for all y (except on a set of 
Lebesgue measure 0). Thus 


Pog(T) = 0} = | fry dy =1 


{y:g(y)=0} 


(b) If g(T) is an unbiased estimate of (0) then 


6 
Fog(T) = - { y” ey) dy = »(8) 
0 


Assuming ¢ continuous we have 


ad 


n 


a 7] 
6"-1p(9) = a6 or g(8) = (8) + ; y'(6) 


Conversely, if g satisfies this equation then n6"1¢(0) = d/d6[6"y(6)] hence 
n fo y" 12(y) dy = 6"y(6), assuming 6"7(6) —> 0 as 6 > 0. Thusa UMVUE 
of y(9) is given by g(T) = »(T) + (T/n)y'(Z). For example, if y(@) = 6 
then 2(T) = T+ 7/n =[( + 1)/n]T; if y@) =1/0 then g(T) = 1/T 
+ (T/n)(—1/T*) = ((/T)f1 — (1/n)], assuming that n > 1. 
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4. We have 


9 e* e@ oy 


nr 
Pyi{R, = %,...,Ry = Ln} = a7 Il Live se 3 @) go 
1 


hence T = max R,; is sufficient. Now 


k n 
Pui <b = (5), k=1,2,...,N; 


therefore 
k” — (k — 1)” 
p(T = = e152 WN 
Thus | 
N k” — (k -—1)” 
agit) -§ 49 [2 EO 
k=1 


If Eyg(T) = Oforall N = 1,2,..., take N = 1 toconclude that g(1) = 0. 


If g(k) = 0 fork =1,...,N —1, then Eyg(T) = 0 implies that 
N” — (N — 1)” 
§(N) | = 


hence g(V) = 0. By induction, g = 0 and T is complete. To find a UMVUE of 
y(N), we must solve the equation 


N 
SY ek" -(k - 1") =N"v(N), N=1,2,.. 
k=1 


or 
&(N)[N" — (N — 1)"] = N*Y(N) — (N — I)" — 1) 

Thus 

N"y(N) — (N — 1)"yW — 1) 


a) = N"—(N—1)" 


; Nea 2s 


5. R is clearly sufficient for itself, and 


go k-1 
Fag(R) = > gk) ( 7 i) — oye 
If E,g(R) = 0 for all 6 © [0, 1) then >= f g(ky\()e" = 0, so that g =0. 
Thus R is complete. The above expression for E,g(R) shows that fora UMVUE 
to exist, »(@) must be expandable in a power series. Conversely, let y(@) = 
De. a,0¢,0 < 6 <1. We must find g such that 


k=r 


e ko ow | 
> gt) ( _ , gr = (1 — 6)-7y(6) = >, b,6* = > b,_,0-7 
7=0 t=r 
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Therefore 


by, 


gf) = | ’ 
(,—) 


For example, if (6) = 6* then 


PS 'or Fe Las 


ee) _ eo) , —_ 1 ; 
(1 — 6)-7y(6) = 6 > (—1)) ( 7 go=)> (’ ad ; jar (Problem 6a, 
j=0 j 
Section 6.4) 


Thus 6; = 0 fori < k, and b; = (¢+7=1-*) fori > k 
i-—1—k 

r—1 } 

&(i) cay Fane a 
r—1 


0 otherwise 


i=rtkrt+k+1,... 


In particular, if k = 1 then 


r— r 
~ fi-1\ i-!1 
r—1 
Thus a UMVUE of 6 = 1 — pis (R — r)/(R — 1);a UMVUE of p = 1 — Ois 


1 — (R -—1r)/(R - 1) = ( — 1)/(R — 1) (The maximum likelihood estimate 
of p is r/R, which is biased; see Problem 3, Section 8.3.) 


—é 


e 00 6 
6. (a) Eyy(R) = Po > vk) TE =e§ 
k=1 


Thus 
oO Qk 00 Qk 
a —; pub —_1\k-—1 __. 


k=1 


The UMVUE is given by 


—1 if k is even 


+1 if k is odd 


v(k) = (-1)** 
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(b) Since e~ is always >0, the estimate found in part (a) looks rather silly. 
If y’(k) = 1 then E,{[y’(R) — e-®}} < E,{[y(R) — e-®} for all 6, hence 
y is inadmissible. 


an 2 an 4 Q2 
5 _— 6 = = oe feos 
B,| (- 2 R; ) Vary (= 2 R,) z= Var, R; = 
n+1 2 n+1 n + 1)? 
B,| (‘ r -0) = Var ) r | = \ 5 Var, T 
n n n 
pre | he eK ee pene 
a* ~*~ on Re 1 ea see Problem ; 
and 
n {9 n@2 
ET? =— n+1 = 
: 6” : yn a n+2 
Thus 
ee ee n n no2 
are lh e2 @+ifl @+ifa +2 
Therefore 
(n +1) 2 02 92 
E P20) see 
| n n(n + 2) S 3n 
ifn > 1 
In the inequality [(a + 5/2}? < @ + b*%)/2, set a = y,(R) — y(), b= 


yo(R) — y(6), to obtain 


R) — »(0 R) — y(6)]2 
Ba) - y(9) “i Po( ae y( ] < Hpy,(8) + pp,(6] = py, () 


By minimality, we actually have equality. But the left side is 


a[Py,(9) + py,(9) + 2E{[y(R) — vOllv(R) — vOV 


= 2 Py, (9) + s Covy [y,(R), yo(R)] 
Thus 


Cove [y,(R), ¥2(R)] = py, (9) = Ley, () py, (OF 
[Vary y,(R) Vaty v2(R)P? 


l 


We therefore have equality in the Schwarz inequality, and it follows that (with 
probability 1) one of the two random variables y,(R) — (6), y2(R) — y(@)isa 
multiple of the other (Problem 3, Section 3.4). The multiple is +1 or —1 since 
y,(R) and y,(R) have the same variance. If the multiple is +1, we are finished, 
and if it is —1, then (y,(R) + ¥2(R))/2 = y(@). The minimum variance is 
therefore 0, hence y,(R) = ».(R) = y(9), as desired. 
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Section 8.6 


Ry oO { Zy 
l. a <,| = | { fia, y) dx dy 
Re 0 Jo 


1 \° 2y 
2¢m+n)/2 D(m/2)T(n/2) Jo ‘ 


But {7 a(m/2)-1e—2/2 de = (with « = uy) fi (wy) /2)-1e-uy /2y dy, Thus 


“= <4 = [a ) d. 
Re 4. Je 


where 
emia) [ ? y )/2-1 /2)(1-++0) 
h(x) = 7 )—- | y m+n) /2—-la—-(y +a dy 
2(m+n)/2T (m2) T'(n]2) ; 
I'[(m + n)[2]xc0m/2)-1 (m+n) /2 
~ Qm+n) 27 (m/2)T (n]2) (1 + x)im)/2 
1 ap (m/2)—1 
== So ya IN PA” Nt > 
Bam/2, nD) A payne? 7 20 
If 
W mae then Gehl —z\— 
ee eg ee y= as aoe 
Ryn mR, : fw ed le 


= fimn(“), as desired 


2. If R is chi-square with n degrees of freedom then R = R,? +--- + R,” where 
the R; are independent and normal (0, 1). Thus E(R) =n, and Var R = n Var 
R? = n[E(R) — [E(RA*)P) = nGB — 1) = 2n. If R has the ¢ distribution with 
n degrees of freedom, then E(R) = 0 by symmetry, unless n = 1, in which case 
R has a Cauchy density and E(R) does not exist. Now in the integral 


io.) got 1 
\ + @ nyo? fee “Seen ae (x?/n) 


so that 
—2x/n —2xy" 
an Cn 
But 
a | 2 fi 
a tee hence dy = --—= ~ Fy de 
ney Vay Y 


The integral becomes 


1! /l-y\. = y 1 
n{——— (n+1)/2 /y — dy 
oI, ( y )y —_ 


1 
= per] yln/2)—2(4 . y)il2 dy = are (; — 1, 5) 


P 2 
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Thus . 

I[@ + 1/2] 73/2 P[(@/2) — 1]P (3/2) 

Vn V(n]/2) P{@ + 1)/2] 

7 nf2 

~ (nf2)-1  n—-2’ 


Var R = E(R®) = 


n>2 


If n = 2, the same calculation gives Var R = ©. 

A similar calculation shows that if R has the F(m, n) distribution then E(R) = 
niin — 2) if n>2, ECR) = © if n=1 or 2, Var R = 2n*(m +n — 2)]/ 
[m(n — 2)*(n — 4] ifn > 4, Var R = oifn =3 or 4. 
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Absolutely continuous random variable, 53 
Absolutely continuous random vector, 72 
Actions, set of, 242 

Admissible and inadmissible tests, 247 
Admissible risk points, 248 

Alternative, simple and composite, 243 
Average value, see Expectation 


Bayes estimate, 260 
with constant risk, 262 
with quadratic loss function, 260 
Bayes risk, 244, 260 
Bayes test, 244 
Bayes’ theorem, 36, 150 
Bernoulli distribution, see Distribution 
Bernoulli trials, 28, 38, 58, 128, 151, 175, 
177, 187, 190, 195, 207, 215 
generalized, 29 
see also Distribution, binomial 
Beta distribution, see Distribution 
Beta function, 133, 261 
Binomial distribution, see Distribution 
Boolean algebra, 3ff 
Borel-Cantelli lemma, 205 
second, 209 
Borel measurable function, 83 
Borel sets, 47, 50 
Bose-Einstein assumption, 21 


Cauchy distribution, see Distribution 
Central limit theorem, 169ff, 171 
Characteristic function(s), 154ff 
correspondence, theorem for, 156 
properties of, 166ff 
of a random vector, 279 
Chebyshev’s inequality, 126, 127, 129, 206, 
208 


Coin tossing, see Bernoulli trials; Distribution, 
binomial 
Combinatorial problems, 15ff 
fallacies in, 39ff 
multiple counting in, 22 
Complement of an event, 4 
Conditional density, 136, 148 
Conditional distribution function, 139, 140, 
148 
Conditional expectation, 140ff 
Conditional probability, 33ff, 130ff 
Conditional probability function, 98, 142 
Confidence coefficient, 276 
Confidence interval, 276 
Confidence set, 278 
Continuous random variable, 69 
Convergence, almost surely (almost every- 
where), 204—206, 208, 210 
in distribution, 170, 171, 175, 176 
in probability, 171, 175, 176, 205, 208, 
210 
Convex function, 262 
Convexity of the risk set, 248 
Convolution theorem, 164 
Correlation, 119ff 
Correlation coefficient, 120 
Covariance, 119 
Covariance function, 203 
Covariance matrix, 281 
Cylinder, 180 
measurable, 180 . 


Decision function, 242, 243 
nonrandomized, 242 
Decision scheme, 151 
DeMorgan laws, 7, 9,11 
Density function(s), 53 
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conditional, 136, 148 
joint, 70ff, 181 
marginal, 78 
Difference equation, 24, 39, 182, 186, 195 
characteristic equation of, 183 
Discrete probability space, 15 
Discrete random variables, 51, 95ff 
Disjoint events, 5 
Distribution, 95 
Bernoulli, 256, 264, 266, 269, 272 
beta, 260, 268 
binomial, 29, 32, 95, 97—99, 113, 122, 141, 
176, 256, 258, 260, 264,268 — 
Cauchy, 161, 166, 264 
chi-square, 165, 275, 276, 278 
exponential, 56, 65, 93, 110, 111, 113, 129, 
139, 150—152, 166, 168, 196, 200— 
202, 256, 263, 264 
F, 278 
gamma, 166, 267, 268 
geometric, 195, 196 
hypergeometric, 33, 256 
multidimensional Gaussian (joint Gaussian), 
279ff 
negative binomial, 196, 264, 268, 272 
normal (Gaussian), 87, 88, 92, 94, 108, 
113, 118, 124—126, 162, 165, 166, 
171, 173-176, 252, 256, 267, 268, 
271, 274—276, 278 
Poisson, 96—99, 114, 152, 163, 169, 197, 
198, 200, 202, 256, 264, 266, 268, 
270, 272 
t, 277, 278 
uniform, 54, 73, 76, 84, 92, 93, 113, 118, 
141, 149, 150—152, 165, 208, 257, 
263, 264, 267, 271, 272 
Distribution function(s), 52 
conditional, 139, 140, 148 
joint, 72 
properties of, 66ff 
Dominated convergence theorem, 231 


Essentially constant random variable, 85, 
115 
Estimate, 258 
Bayes, 260 
with constant risk, 262 
inadmissible, 272 
maximum likelihood, 258 
minimax, 262 
randomized, 258, 263 
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risk function of, 261 
unbiased, 268 
uniformly minimum variance unbiased 
(UMVUE), 269 
Estimation, 152, 242, 243, 258ff 
Event(s), 2, 11 
algebra of, 3ff 
complement of, 4 
contracting sequence of, 67 
exhaustive, 35 
expanding sequence of, 66 
impossible, 3, 55 
independent, 26, 27 
intersection of, 4 
mutually exclusive (disjoint), 5 
union of, 4 
upper and lower limits of sequence of, 204, 
209 
sure (certain), 3 
Expectation, 100ff 
conditional, 140ff 
general definition of, 103 
properties of, 114ff 
Exponential distribution, see Distribution 


F distribution, see Distribution 
Factorization theorem, 266 
Fatou’s lemma, 230 
Fermi-Dirac assumption, 20 
Fourier series, 167 

Fourier transform, 155 


Gambler’s ruin problem, 182ff, 235 
Gamma distribution, see Distribution 
Gamma function, 109, 133 
Gaussian distribution, see Distribution, normal 
Generating function, 169, 191ff 
moments obtained from, 192 
Geometric distribution, see Distribution 


Hypergeometric distribution, see Distribution 
Hypothesis, 243ff 
a priori probability of, 244 
composite, 243 
null, 243 
simple, 243 
Hypothesis testing, 151, 242, 243ff 
fundamental theorem of, 246 
see also Test 


Independence, 2Sff 
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Independence of sample mean and variance in 
normal sampling, 274 

Indepdndent events, 26, 27 

Independent random variables, 80 

Indicators, 122ff 

Intersection of events, 4 


Jensen’s inequality, 262 

Joint characteristic function, 279 

Joint density function, 70ff, 181 

Joint distribution function, 72 

Joint probability function, 76, 96, 180, 181 


Kolmogorov extension theorem, 180 


Laplace transform, 155 
properties of, 156, 157 
Lattice distribution, 169 
Law of large numbers, strong, 129, 203, 206, 
207 
weak, 128, 169, 171, 207 
Lebesgue integral, 114 
Level of a test, 246 
Liapounov condition, 175 
Likelihood ratio, 245 
test (LRT), 245 
Limit inferior (lower limit), 204, 209 
Limit superior (upper limit), 204, 209 
Linearly dependent random variables, 121, 
281 
Loss function (cost function), 242 
quadratic, 260 


Marginal densities, 78 
Markov chain(s), 211ff 
closed sets of, 224 
cyclically moving subclasses of, 227 
definition of , 214 
equivalence classes of states of, 223 
first entrance theorem for, 220 
initial distribution of, 213 
limiting probabilities of, 230ff 
state distribution of, 214 
state space of, 213 
states of, 220ff 
aperiodic, periodic, 229 
essential, 229 
mean recurrence time of, 230 
period of, 226—229 
recurrent (persistent), 221ff 
recurrent null, 233 
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recurrent positive, 233 
transient, 221ff 
stationary distribution for, 236 
steady state distribution for, 215, 237 
stopping time for, 217 
strong Markov property of, 219 
transition matrix of, 214 
n-step, 214 
transition probabilities of, 214 
Maximum likelihood estimate, 258 
Maxwell-Boltzmann assumption, 20 
Median, 112 
Mean, see Expectation 
Minimax estimate, 262 
Minimax test, 250 
Moment-generating property of character- 
istic functions, 167, 168 
Moments, 107 
central, 108 
joint, 119 
obtained from generating functions, 192 
Multinomial probability function, 30, 98 
Mutually exclusive events, 5 


Negative binomial distribution, see Distri- 
bution 

Negative part of a random variable, 104 

Neyman-Pearson lemma, 246 

Normal distribution, see Distribution 


Observable, 242 
Order statistics, 91 


Partial fraction expansion, 159 
Poisson distribution, see Distribution 
Poisson random process, 196ff 
Poker, 19, 23, 40 
Positive part of a random variable, 104 
Power function of a test, 253 
Power of a test, 246 
Probability, 10ff 
a posteriori, 36 
classical definition of, 1, 13, 16 
conditional, 33ff 
frequency definition of, 2, 13 
Probability function, 51 
conditional, 98, 142 
joint, 76, 96, 180, 181 
Probability measure(s), 12 
consistent, 180 
Probability space, 12 


336 


discrete, 15 
Queueing, 216 


Random process, 196 
Random telegraph signal, 203 
Random variable(s), 46ff 
absolutely continuous, 53 
central moments of, 108 
characteristic function of, 154ff 
classification of, 51ff 
continuous, 69 
definition of, 48, 50 
degenerate (essentially constant), 85, 115 
density function of, 53 
discrete, 51, 95ff 
functions of, 58ff, 84, 85ff, 94 
generating function of, 192ff 
independent, 80 
infinite sequences of, 178ff 
linearly dependent, 121, 281 
moments of, 107 
positive and negative parts of, 104 
probability function of, 51 
simple, 101 
Random vector, 72 
absolutely continuous, 72 
Random walk, 184ff 
combinatorial approach to, 186ff 
simple, 184 
with absorbing barriers, 184, 185, 215, 
228, 240 
with no barriers, 185, 186—191, 193- 
195, 215, 228, 240 
average length of time required to 
return to 0 in, 186, 191, 195 
distribution of first return to 0 in, 189 
first passage times in, 190 
probability of eventual return to 0 in, 
185 
with reflecting barriers, 229, 240 
Rao-Blackwell theorem, 263 
Recurrent (persistent) states of a Markov 
chain, 221 
Reflection principle, 188 
Renewal theorem, 235 
Risk function, 261 
Risk set, 248 


Sample mean, 259, 274 
Sample space, 2 
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Sample variance, 259, 274 
Samples, 16ff 
ordered, with replacement, 16 
without replacement, 16 
unordered, with replacement, 18 
without replacement, 17 
Sampling from a normal population, 274 
Schwarz inequality, 119, 121, 207 
Sigma field, 11 
Simple random variable, 101 
Size of a test, 246 
Standard deviation, 108 
States of nature, 241 
Statistic, for a random variable, 265 
complete, 269 
sufficient, 265 
Statistical decision model, 241 
Statistics, 241ff 
Stirling’s formula, 43, 191 
Stochastic matrix, 212 
Stochastic process, 196 
Stopping times, 217ff 
Strong law of large numbers, 129, 203, 206, 
207 
Strong Markov property, 219 


t Distribution, see Distribution 
Test, 243 
acceptance region of, 278 
admissible and inadmissible, 247 
Bayes, 244 
level of, 246 
likelihood ratio (LRT), 245 
minimax, 250 
power of, 246 
power function of, 253 
rejection region (critical region) for, 243 
risk set of, 248 
size of, 246 
type 1 and type 2 errors of, 243 
uniformly most powerful (UMP), 253 
Total expectation, theorem of, 144, 149, 152, 
153 
Total probability, theorem of, 35, 90, 130, 
132, 134, 150, 182, 214 
Transient states of a Markov chain, 221 


Uniform distribution, see Distribution 
Uniformly most powerful (UMP) test, 253 
Union of events, 4 

Unit step function, 157 
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Variance, 108, 155—118 Weak law of large numbers, 128, 169, 171, 
Venn diagrams, 4 207 


